handouts/ho03.tex
changeset 504 5dc452d7c08e
parent 497 c498cb53a9a8
child 518 aecbe0077f2d
equal deleted inserted replaced
503:f2d7b885b3e3 504:5dc452d7c08e
     3 \usepackage{../langs}
     3 \usepackage{../langs}
     4 \usepackage{../graphics}
     4 \usepackage{../graphics}
     5 
     5 
     6 
     6 
     7 \begin{document}
     7 \begin{document}
     8 
     8 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017}
     9 \section*{Handout 3 (Automata)}
     9 
    10 
    10 \section*{Handout 3 (Finite Automata)}
    11 Every formal language course I know of bombards you first with
    11 
    12 automata and then to a much, much smaller extend with regular
    12 
    13 expressions. As you can see, this course is turned upside
    13 Every formal language and compiler course I know of bombards you first
    14 down: regular expressions come first. The reason is that
    14 with automata and then to a much, much smaller extend with regular
    15 regular expressions are easier to reason about and the notion
    15 expressions. As you can see, this course is turned upside down:
    16 of derivatives, although already quite old, only became more
    16 regular expressions come first. The reason is that regular expressions
    17 widely known rather recently. Still let us in this lecture
    17 are easier to reason about and the notion of derivatives, although
    18 have a closer look at automata and their relation to regular
    18 already quite old, only became more widely known rather
    19 expressions. This will help us with understanding why the
    19 recently. Still, let us in this lecture have a closer look at automata
    20 regular expression matchers in Python, Ruby and Java are so slow
    20 and their relation to regular expressions. This will help us with
    21 with certain regular expressions. The central definition
    21 understanding why the regular expression matchers in Python, Ruby and
    22 is:\medskip 
    22 Java are so slow with certain regular expressions. On the way we will
       
    23 also see what are the limitations of regular
       
    24 expressions. Unfortunately, they cannot be used for \emph{everything}.
       
    25 
       
    26 
       
    27 \subsection*{Deterministic Finite Automata}
       
    28 
       
    29 Lets start\ldots the central definition is:\medskip
    23 
    30 
    24 \noindent 
    31 \noindent 
    25 A \emph{deterministic finite automaton} (DFA), say $A$, is
    32 A \emph{deterministic finite automaton} (DFA), say $A$, is
    26 defined by a four-tuple written $A(Q, q_0, F, \delta)$ where
    33 given by a five-tuple written ${\cal A}(\varSigma, Qs, Q_0, F, \delta)$ where
    27 
    34 
    28 \begin{itemize}
    35 \begin{itemize}
    29 \item $Q$ is a finite set of states,
    36 \item $\varSigma$ is an alphabet,  
    30 \item $q_0 \in Q$ is the start state,
    37 \item $Qs$ is a finite set of states,
    31 \item $F \subseteq Q$ are the accepting states, and
    38 \item $Q_0 \in Qs$ is the start state,
       
    39 \item $F \subseteq Qs$ are the accepting states, and
    32 \item $\delta$ is the transition function.
    40 \item $\delta$ is the transition function.
    33 \end{itemize}
    41 \end{itemize}
    34 
    42 
    35 \noindent The transition function determines how to
    43 \noindent I am sure you have seen this definition already
    36 ``transition'' from one state to the next state with respect
    44 before. The transition function determines how to ``transition'' from
    37 to a character. We have the assumption that these transition
    45 one state to the next state with respect to a character. We have the
    38 functions do not need to be defined everywhere: so it can be
    46 assumption that these transition functions do not need to be defined
    39 the case that given a character there is no next state, in
    47 everywhere: so it can be the case that given a character there is no
    40 which case we need to raise a kind of ``failure exception''. A
    48 next state, in which case we need to raise a kind of ``failure
       
    49 exception''.  That means actually we have \emph{partial} functions as
       
    50 transitions---see the Scala implementation for DFAs later on.  A
    41 typical example of a DFA is
    51 typical example of a DFA is
    42 
    52 
    43 \begin{center}
    53 \begin{center}
    44 \begin{tikzpicture}[>=stealth',very thick,auto,
    54 \begin{tikzpicture}[>=stealth',very thick,auto,
    45                     every state/.style={minimum size=0pt,
    55                     every state/.style={minimum size=0pt,
    46                     inner sep=2pt,draw=blue!50,very thick,
    56                     inner sep=2pt,draw=blue!50,very thick,
    47                     fill=blue!20},scale=2]
    57                     fill=blue!20},scale=2]
    48 \node[state,initial]  (q_0)  {$q_0$};
    58 \node[state,initial]  (Q_0)  {$Q_0$};
    49 \node[state] (q_1) [right=of q_0] {$q_1$};
    59 \node[state] (Q_1) [right=of Q_0] {$Q_1$};
    50 \node[state] (q_2) [below right=of q_0] {$q_2$};
    60 \node[state] (Q_2) [below right=of Q_0] {$Q_2$};
    51 \node[state] (q_3) [right=of q_2] {$q_3$};
    61 \node[state] (Q_3) [right=of Q_2] {$Q_3$};
    52 \node[state, accepting] (q_4) [right=of q_1] {$q_4$};
    62 \node[state, accepting] (Q_4) [right=of Q_1] {$Q_4$};
    53 \path[->] (q_0) edge node [above]  {$a$} (q_1);
    63 \path[->] (Q_0) edge node [above]  {$a$} (Q_1);
    54 \path[->] (q_1) edge node [above]  {$a$} (q_4);
    64 \path[->] (Q_1) edge node [above]  {$a$} (Q_4);
    55 \path[->] (q_4) edge [loop right] node  {$a, b$} ();
    65 \path[->] (Q_4) edge [loop right] node  {$a, b$} ();
    56 \path[->] (q_3) edge node [right]  {$a$} (q_4);
    66 \path[->] (Q_3) edge node [right]  {$a$} (Q_4);
    57 \path[->] (q_2) edge node [above]  {$a$} (q_3);
    67 \path[->] (Q_2) edge node [above]  {$a$} (Q_3);
    58 \path[->] (q_1) edge node [right]  {$b$} (q_2);
    68 \path[->] (Q_1) edge node [right]  {$b$} (Q_2);
    59 \path[->] (q_0) edge node [above]  {$b$} (q_2);
    69 \path[->] (Q_0) edge node [above]  {$b$} (Q_2);
    60 \path[->] (q_2) edge [loop left] node  {$b$} ();
    70 \path[->] (Q_2) edge [loop left] node  {$b$} ();
    61 \path[->] (q_3) edge [bend left=95, looseness=1.3] node [below]  {$b$} (q_0);
    71 \path[->] (Q_3) edge [bend left=95, looseness=1.3] node [below]  {$b$} (Q_0);
    62 \end{tikzpicture}
    72 \end{tikzpicture}
    63 \end{center}
    73 \end{center}
    64 
    74 
    65 \noindent In this graphical notation, the accepting state
    75 \noindent In this graphical notation, the accepting state $Q_4$ is
    66 $q_4$ is indicated with double circles. Note that there can be
    76 indicated with double circles. Note that there can be more than one
    67 more than one accepting state. It is also possible that a DFA
    77 accepting state. It is also possible that a DFA has no accepting
    68 has no accepting states at all, or that the starting state is
    78 state at all, or that the starting state is also an accepting
    69 also an accepting state. In the case above the transition
    79 state. In the case above the transition function is defined everywhere
    70 function is defined everywhere and can be given as a table as
    80 and can also be given as a table as follows:
    71 follows:
       
    72 
    81 
    73 \[
    82 \[
    74 \begin{array}{lcl}
    83 \begin{array}{lcl}
    75 (q_0, a) &\rightarrow& q_1\\
    84 (Q_0, a) &\rightarrow& Q_1\\
    76 (q_0, b) &\rightarrow& q_2\\
    85 (Q_0, b) &\rightarrow& Q_2\\
    77 (q_1, a) &\rightarrow& q_4\\
    86 (Q_1, a) &\rightarrow& Q_4\\
    78 (q_1, b) &\rightarrow& q_2\\
    87 (Q_1, b) &\rightarrow& Q_2\\
    79 (q_2, a) &\rightarrow& q_3\\
    88 (Q_2, a) &\rightarrow& Q_3\\
    80 (q_2, b) &\rightarrow& q_2\\
    89 (Q_2, b) &\rightarrow& Q_2\\
    81 (q_3, a) &\rightarrow& q_4\\
    90 (Q_3, a) &\rightarrow& Q_4\\
    82 (q_3, b) &\rightarrow& q_0\\
    91 (Q_3, b) &\rightarrow& Q_0\\
    83 (q_4, a) &\rightarrow& q_4\\
    92 (Q_4, a) &\rightarrow& Q_4\\
    84 (q_4, b) &\rightarrow& q_4\\
    93 (Q_4, b) &\rightarrow& Q_4\\
    85 \end{array}
    94 \end{array}
    86 \]
    95 \]
       
    96 
       
    97 \noindent
       
    98 Please check that this table represents the same transition function
       
    99 as the graph above.
    87 
   100 
    88 We need to define the notion of what language is accepted by
   101 We need to define the notion of what language is accepted by
    89 an automaton. For this we lift the transition function
   102 an automaton. For this we lift the transition function
    90 $\delta$ from characters to strings as follows:
   103 $\delta$ from characters to strings as follows:
    91 
   104 
    92 \[
   105 \[
    93 \begin{array}{lcl}
   106 \begin{array}{lcl}
    94 \hat{\delta}(q, [])        & \dn & q\\
   107 \widehat{\delta}(q, [])        & \dn & q\\
    95 \hat{\delta}(q, c\!::\!s) & \dn & \hat{\delta}(\delta(q, c), s)\\
   108 \widehat{\delta}(q, c\!::\!s) & \dn & \widehat{\delta}(\delta(q, c), s)\\
    96 \end{array}
   109 \end{array}
    97 \]
   110 \]
    98 
   111 
    99 \noindent This lifted transition function is often called
   112 \noindent This lifted transition function is often called
   100 ``delta-hat''. Given a string, we start in the starting state
   113 \emph{delta-hat}. Given a string, we start in the starting state and
   101 and take the first character of the string, follow to the next
   114 take the first character of the string, follow to the next state, then
   102 sate, then take the second character and so on. Once the
   115 take the second character and so on. Once the string is exhausted and
   103 string is exhausted and we end up in an accepting state, then
   116 we end up in an accepting state, then this string is accepted by the
   104 this string is accepted by the automaton. Otherwise it is not
   117 automaton. Otherwise it is not accepted. This also means that if along
   105 accepted. So $s$ is in the \emph{language accepted by the
   118 the way we hit the case where the transition function $\delta$ is not
   106 automaton} $A(Q, q_0, F, \delta)$ iff
   119 defined, we need to raise an error. In our implementation we will deal
       
   120 with this case elegantly by using Scala's \texttt{Try}.  Summing up: a
       
   121 string $s$ is in the \emph{language accepted by the automaton} ${\cal
       
   122   A}(\varSigma, Q, Q_0, F, \delta)$ iff
   107 
   123 
   108 \[
   124 \[
   109 \hat{\delta}(q_0, s) \in F 
   125 \widehat{\delta}(Q_0, s) \in F 
   110 \]
   126 \]
   111 
   127 
   112 \noindent I let you think about a definition that describes
   128 \noindent I let you think about a definition that describes the set of
   113 the set of strings accepted by an automaton.
   129 all strings accepted by a deterministic finite automaton.
   114   
   130 
   115 
   131 \begin{figure}[p]
   116 While with DFAs it will always be clear that given a character
   132 \small
   117 what the next state is (potentially none), it will be useful
   133 \lstinputlisting[numbers=left]{../progs/display/dfa.scala}
   118 to relax this restriction. That means we have several
   134 \caption{A Scala implementation of DFAs using partial functions.
   119 potential successor states. We even allow ``silent
   135   Note some subtleties: \texttt{deltas} implements the delta-hat
   120 transitions'', also called epsilon-transitions. They allow us
   136   construction by lifting the (partial) transition  function to lists
   121 to go from one state to the next without having a character
   137   of characters. Since \texttt{delta} is given as a partial function,
   122 consumed. We label such silent transition with the letter
   138   it can obviously go ``wrong'' in which case the \texttt{Try} in
   123 $\epsilon$. The resulting construction is called a
   139   \texttt{accepts} catches the error and returns \texttt{false}---that
   124 \emph{non-deterministic finite automaton} (NFA) given also as
   140   means the string is not accepted.  The example \texttt{delta} in
   125 a four-tuple $A(Q, q_0, F, \rho)$ where
   141   Line 28--38 implements the DFA example shown earlier in the
       
   142   handout.\label{dfa}}
       
   143 \end{figure}
       
   144 
       
   145 My take on an implementation of DFAs in Scala is given in
       
   146 Figure~\ref{dfa}. As you can see, there are many features of the
       
   147 mathematical definition that are quite closely reflected in the
       
   148 code. In the DFA-class, there is a starting state, called
       
   149 \texttt{start}, with the polymorphic type \texttt{A}.  There is a
       
   150 partial function \texttt{delta} for specifying the transitions---these
       
   151 partial functions take a state (of polymorphic type \texttt{A}) and an
       
   152 input (of polymorphic type \texttt{C}) and produce a new state (of
       
   153 type \texttt{A}). For the moment it is OK to assume that \texttt{A} is
       
   154 some arbitrary type for states and the input is just characters.  (The
       
   155 reason for not having concrete types, but polymorphic types for the
       
   156 states and the input of DFAs will become clearer later on.)
       
   157 
       
   158 The DFA-class has also an argument for specifying final states. In the
       
   159 implementation it is not a set of states, as in the mathematical
       
   160 definition, but a function from states to booleans (this function is
       
   161 supposed to return true whenever a state is final; false
       
   162 otherwise). While this boolean function is different from the sets of
       
   163 states, Scala allows to use sets for such functions (see Line 40 where
       
   164 the DFA is initialised). Again it will become clear later on why I use
       
   165 functions for final states, rather than sets.
       
   166 
       
   167 The most important point in the implementation is that I use Scala's
       
   168 partial functions for representing the transitions; alternatives would
       
   169 have been \texttt{Maps} or even \texttt{Lists}. One of the main
       
   170 advantages of using partial functions is that transitions can be quite
       
   171 nicely defined by a series of \texttt{case} statements (see Lines 28
       
   172 -- 38 for an example). If you need to represent an automaton with a
       
   173 sink state (catch-all-state), you can use Scala's pattern matching and
       
   174 write something like
       
   175 
       
   176 {\small\begin{lstlisting}[language=Scala]
       
   177 abstract class State
       
   178 ...
       
   179 case object Sink extends State
       
   180 
       
   181 val delta : (State, Char) :=> State = 
       
   182   { case (S0, 'a') => S1
       
   183     case (S1, 'a') => S2
       
   184     case _ => Sink
       
   185   }
       
   186 \end{lstlisting}}  
       
   187 
       
   188 \noindent I let you think what the corresponding DFA looks like in the
       
   189 graph notation. Also, I suggest you to tinker with the Scala code in
       
   190 order to define the DFA that does not accept any string at all. 
       
   191 
       
   192 Finally, I let you ponder whether this is a good implementation of
       
   193 DFAs or not. In doing so I hope you notice that the $\varSigma$ and
       
   194 $Qs$ components (the alphabet and the set of finite states,
       
   195 respectively) are missing from the class definition. This means that
       
   196 the implementation allows you to do some fishy things you are not
       
   197 meant to do with DFAs. Which fishy things could that be?
       
   198 
       
   199 
       
   200 
       
   201 \subsection*{Non-Deterministic Finite Automata}
       
   202 
       
   203 Remember we want to find out what the relation is between regular
       
   204 expressions and automata. To do this with DFAs is a bit unwieldy.
       
   205 While with DFAs it is always clear that given a state and a character
       
   206 what the next state is (potentially none), it will be convenient to
       
   207 relax this restriction. That means we allow states to have several
       
   208 potential successor states. We even allow more than one starting
       
   209 state. The resulting construction is called a \emph{Non-Deterministic
       
   210   Finite Automaton} (NFA) given also as a five-tuple ${\cal
       
   211   A}(\varSigma, Qs, Q_{0s}, F, \rho)$ where
   126 
   212 
   127 \begin{itemize}
   213 \begin{itemize}
   128 \item $Q$ is a finite set of states
   214 \item $\varSigma$ is an alphabet,  
   129 \item $q_0$ is a start state
   215 \item $Qs$ is a finite set of states
   130 \item $F$ are some accepting states with $F \subseteq Q$, and
   216 \item $Q_{0s}$ is a set of start states ($Q_{0s} \subseteq Qs$)
       
   217 \item $F$ are some accepting states with $F \subseteq Qs$, and
   131 \item $\rho$ is a transition relation.
   218 \item $\rho$ is a transition relation.
   132 \end{itemize}
   219 \end{itemize}
   133 
   220 
   134 \noindent
   221 \noindent
   135 Two typical examples of NFAs are
   222 A typical example of a NFA is
       
   223 
       
   224 % A NFA for (ab* + b)*a
       
   225 \begin{center}
       
   226 \begin{tikzpicture}[>=stealth',very thick, auto,
       
   227     every state/.style={minimum size=0pt,inner sep=3pt,
       
   228       draw=blue!50,very thick,fill=blue!20},scale=2]
       
   229 \node[state,initial]  (Q_0)  {$Q_0$};
       
   230 \node[state] (Q_1) [right=of Q_0] {$Q_1$};
       
   231 \node[state, accepting] (Q_2) [right=of Q_1] {$Q_2$};
       
   232 \path[->] (Q_0) edge [loop above] node  {$b$} ();
       
   233 \path[<-] (Q_0) edge node [below]  {$b$} (Q_1);
       
   234 \path[->] (Q_0) edge [bend left] node [above]  {$a$} (Q_1);
       
   235 \path[->] (Q_0) edge [bend right] node [below]  {$a$} (Q_2);
       
   236 \path[->] (Q_1) edge [loop above] node  {$a,b$} ();
       
   237 \path[->] (Q_1) edge node  [above] {$a$} (Q_2);
       
   238 \end{tikzpicture}
       
   239 \end{center}
       
   240 
       
   241 \noindent
       
   242 This NFA happens to have only one starting state, but in general there
       
   243 could be more than one.  Notice that in state $Q_0$ we might go to
       
   244 state $Q_1$ \emph{or} to state $Q_2$ when receiving an $a$. Similarly
       
   245 in state $Q_1$ and receiving an $a$, we can stay in state $Q_1$
       
   246 \emph{or} go to $Q_2$. This kind of choice is not allowed with
       
   247 DFAs. The downside of this choice in NFAs is that when it comes to
       
   248 deciding whether a string is accepted by a NFA we potentially have to
       
   249 explore all possibilities. I let you think which strings the above NFA
       
   250 accepts.
       
   251 
       
   252 
       
   253 There are a number of additional points you should note about
       
   254 NFAs. Every DFA is a NFA, but not vice versa. The $\rho$ in NFAs is a
       
   255 transition \emph{relation} (DFAs have a transition function). The
       
   256 difference between a function and a relation is that a function has
       
   257 always a single output, while a relation gives, roughly speaking,
       
   258 several outputs. Look again at the NFA above: if you are currently in
       
   259 the state $Q_1$ and you read a character $b$, then you can transition
       
   260 to either $Q_0$ \emph{or} $Q_2$. Which route, or output, you take is
       
   261 not determined.  This non-determinism can be represented by a
       
   262 relation.
       
   263 
       
   264 My implementation of NFAs in Scala is shown in Figure~\ref{nfa}.
       
   265 Perhaps interestingly, I do not actually use relations for my NFAs,
       
   266 but use transition functions that return sets of states.  DFAs have
       
   267 partial transition functions that return a single state; my NFAs
       
   268 return a set of states. I let you think about this representation for
       
   269 NFA-transitions and how it corresponds to the relations used in the
       
   270 mathematical definition of NFAs. An example of a transition function
       
   271 in Scala for the NFA shown above is
       
   272 
       
   273 {\small\begin{lstlisting}[language=Scala]
       
   274 val nfa_delta : (State, Char) :=> Set[State] = 
       
   275   { case (Q0, 'a') => Set(Q1, Q2)
       
   276     case (Q0, 'b') => Set(Q0)
       
   277     case (Q1, 'a') => Set(Q1, Q2)
       
   278     case (Q1, 'b') => Set(Q0, Q1) }
       
   279 \end{lstlisting}}  
       
   280 
       
   281 Like in the mathematical definition, \texttt{starts} is in
       
   282 NFAs a set of states; \texttt{fins} is again a function from states to
       
   283 booleans. The \texttt{next} function calculates the set of next states
       
   284 reachable from a single state \texttt{q} by a character~\texttt{c}. In
       
   285 case there is no such state---the partial transition function is
       
   286 undefined---the empty set is returned (see function
       
   287 \texttt{applyOrElse} in Lines 9 and 10). The function \texttt{nexts}
       
   288 just lifts this function to sets of states.
       
   289  
       
   290 \begin{figure}[p]
       
   291 \small
       
   292 \lstinputlisting[numbers=left]{../progs/display/nfa.scala}
       
   293 \caption{A Scala implementation of NFAs using partial functions.
       
   294   Notice that the function \texttt{accepts} implements the
       
   295   acceptance of a string in a breath-first search fashion. This can be a costly
       
   296   way of deciding whether a string is accepted or not in applications that need to handle
       
   297   large NFAs and large inputs.\label{nfa}}
       
   298 \end{figure}
       
   299 
       
   300 Look very careful at the \texttt{accepts} and \texttt{deltas}
       
   301 functions in NFAs and remember that when accepting a string by a NFA
       
   302 we might have to explore all possible transitions (recall which state
       
   303 to go to is not unique anymore with NFAs\ldots{}we need to explore
       
   304 potentially all next states). The implementation achieves this
       
   305 exploration through a \emph{breadth-first search}. This is fine for
       
   306 small NFAs, but can lead to real memory problems when the NFAs are
       
   307 bigger and larger strings need to be processed. As result, some
       
   308 regular expression matching engines resort to a \emph{depth-first
       
   309   search} with \emph{backtracking} in unsuccessful cases. In our
       
   310 implementation we can implement a depth-first version of
       
   311 \texttt{accepts} using Scala's \texttt{exists}-function as follows:
       
   312 
       
   313 
       
   314 {\small\begin{lstlisting}[language=Scala]
       
   315 def search(q: A, s: List[C]) : Boolean = s match {
       
   316   case Nil => fins(q)
       
   317   case c::cs => next(q, c).exists(search(_, cs)) 
       
   318 }
       
   319 
       
   320 def accepts2(s: List[C]) : Boolean = 
       
   321   starts.exists(search(_, s))
       
   322 \end{lstlisting}}
       
   323 
       
   324 \noindent
       
   325 This depth-first way of exploration seems to work quite efficiently in
       
   326 many examples and is much less of a strain on memory. The problem is
       
   327 that the backtracking can get ``catastrophic'' in some
       
   328 examples---remember the catastrophic backtracking from earlier
       
   329 lectures. This depth-first search with backtracking is the reason for
       
   330 the abysmal performance of some regular expression matchings in Java,
       
   331 Ruby and Python. I like to show you this in the next two sections.
       
   332 
       
   333 
       
   334 \subsection*{Epsilon NFAs}
       
   335 
       
   336 In order to get an idea what calculations are performed by Java \&
       
   337 friends, we need a method for transforming a regular expression into
       
   338 an automaton. This automaton should accept exactly those strings that
       
   339 are accepted by the regular expression.  The simplest and most
       
   340 well-known method for this is called \emph{Thompson Construction},
       
   341 after the Turing Award winner Ken Thompson. This method is by
       
   342 recursion over regular expressions and depends on the non-determinism
       
   343 in NFAs described in the previous section. You will see shortly why
       
   344 this construction works well with NFAs, but is not so straightforward
       
   345 with DFAs.
       
   346 
       
   347 Unfortunately we are still one step away from our intended target
       
   348 though---because this construction uses a version of NFAs that allows
       
   349 ``silent transitions''. The idea behind silent transitions is that
       
   350 they allow us to go from one state to the next without having to
       
   351 consume a character. We label such silent transition with the letter
       
   352 $\epsilon$ and call the automata $\epsilon$NFAs. Two typical examples
       
   353 of $\epsilon$NFAs are:
       
   354 
       
   355 
   136 \begin{center}
   356 \begin{center}
   137 \begin{tabular}[t]{c@{\hspace{9mm}}c}
   357 \begin{tabular}[t]{c@{\hspace{9mm}}c}
   138 \begin{tikzpicture}[scale=0.7,>=stealth',very thick,
   358 \begin{tikzpicture}[>=stealth',very thick,
   139                              every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]
   359     every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]
   140 \node[state,initial]  (q_0)  {$q_0$};
   360 \node[state,initial]  (Q_0)  {$Q_0$};
   141 \node[state] (q_1) [above=of q_0] {$q_1$};
   361 \node[state] (Q_1) [above=of Q_0] {$Q_1$};
   142 \node[state, accepting] (q_2) [below=of q_0] {$q_2$};
   362 \node[state, accepting] (Q_2) [below=of Q_0] {$Q_2$};
   143 \path[->] (q_0) edge node [left]  {$\epsilon$} (q_1);
   363 \path[->] (Q_0) edge node [left]  {$\epsilon$} (Q_1);
   144 \path[->] (q_0) edge node [left]  {$\epsilon$} (q_2);
   364 \path[->] (Q_0) edge node [left]  {$\epsilon$} (Q_2);
   145 \path[->] (q_0) edge [loop right] node  {$a$} ();
   365 \path[->] (Q_0) edge [loop right] node  {$a$} ();
   146 \path[->] (q_1) edge [loop above] node  {$a$} ();
   366 \path[->] (Q_1) edge [loop right] node  {$a$} ();
   147 \path[->] (q_2) edge [loop below] node  {$b$} ();
   367 \path[->] (Q_2) edge [loop right] node  {$b$} ();
   148 \end{tikzpicture} &
   368 \end{tikzpicture} &
   149 
   369 
   150 \raisebox{20mm}{
   370 \raisebox{20mm}{
   151 \begin{tikzpicture}[scale=0.7,>=stealth',very thick,
   371 \begin{tikzpicture}[>=stealth',very thick,
   152                              every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]
   372     every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]
   153 \node[state,initial]  (r_1)  {$r_1$};
   373 \node[state,initial]  (r_1)  {$R_1$};
   154 \node[state] (r_2) [above=of r_1] {$r_2$};
   374 \node[state] (r_2) [above=of r_1] {$R_2$};
   155 \node[state, accepting] (r_3) [right=of r_1] {$r_3$};
   375 \node[state, accepting] (r_3) [right=of r_1] {$R_3$};
   156 \path[->] (r_1) edge node [below]  {$b$} (r_3);
   376 \path[->] (r_1) edge node [below]  {$b$} (r_3);
   157 \path[->] (r_2) edge [bend left] node [above]  {$a$} (r_3);
   377 \path[->] (r_2) edge [bend left] node [above]  {$a$} (r_3);
   158 \path[->] (r_1) edge [bend left] node  [left] {$\epsilon$} (r_2);
   378 \path[->] (r_1) edge [bend left] node  [left] {$\epsilon$} (r_2);
   159 \path[->] (r_2) edge [bend left] node  [right] {$a$} (r_1);
   379 \path[->] (r_2) edge [bend left] node  [right] {$a$} (r_1);
   160 \end{tikzpicture}}
   380 \end{tikzpicture}}
   161 \end{tabular}
   381 \end{tabular}
   162 \end{center}
   382 \end{center}
   163 
   383 
   164 \noindent There are, however, a number of points you should
   384 \noindent
   165 note. Every DFA is a NFA, but not vice versa. The $\rho$ in
   385 Consider the $\epsilon$NFA on the left-hand side: the
   166 NFAs is a transition \emph{relation} (DFAs have a transition
   386 $\epsilon$-transitions mean you do not have to ``consume'' any part of
   167 function). The difference between a function and a relation is
   387 the input string, but ``silently'' change to a different state. In
   168 that a function has always a single output, while a relation
   388 this example, if you are in the starting state $Q_0$, you can silently
   169 gives, roughly speaking, several outputs. Look at the NFA on
   389 move either to $Q_1$ or $Q_2$. You can see that once you are in $Q_1$,
   170 the right-hand side above: if you are currently in the state
   390 respectively $Q_2$, you cannot ``go back'' to the other states. So it
   171 $r_2$ and you read a character $a$, then you can transition to
   391 seems allowing $\epsilon$-transitions is a rather substantial
   172 either $r_1$ \emph{or} $r_3$. Which route you take is not
   392 extension to NFAs. On first appearances, $\epsilon$-transitions might
   173 determined. This means if we need to decide whether a string
   393 even look rather strange, or even silly. To start with, silent
   174 is accepted by a NFA, we might have to explore all
   394 transitions make the decision whether a string is accepted by an
   175 possibilities. Also there is the special silent transition in
   395 automaton even harder: with $\epsilon$NFAs we have to look whether we
   176 NFAs. As mentioned already this transition means you do not
   396 can do first some $\epsilon$-transitions and then do a
   177 have to ``consume'' any part of the input string, but
   397 ``proper''-transition; and after any ``proper''-transition we again
   178 ``silently'' change to a different state. In the left picture,
   398 have to check whether we can do again some silent transitions. Even
   179 for example, if you are in the starting state, you can 
   399 worse, if there is a silent transition pointing back to the same
   180 silently move either to $q_1$ or $q_2$. This silent
   400 state, then we have to be careful our decision procedure for strings
   181 transition is also often called \emph{$\epsilon$-transition}.
   401 does not loop (remember the depth-first search for exploring all
   182 
   402 states).
   183 
   403 
   184 \subsubsection*{Thompson Construction}
   404 The obvious question is: Do we get anything in return for this hassle
   185 
   405 with silent transitions?  Well, we still have to work for it\ldots
   186 The reason for introducing NFAs is that there is a relatively
   406 unfortunately.  If we were to follow the many textbooks on the
   187 simple (recursive) translation of regular expressions into
   407 subject, we would now start with defining what $\epsilon$NFAs
   188 NFAs. Consider the simple regular expressions $\ZERO$,
   408 are---that would require extending the transition relation of
   189 $\ONE$ and $c$. They can be translated as follows:
   409 NFAs. Next, we would show that the $\epsilon$NFAs are equivalent to
   190 
   410 NFAs and so on. Once we have done all this on paper, we would need to
   191 \begin{center}
   411 implement $\epsilon$NFAs. Lets try to take a shortcut instead. We are
       
   412 not really interested in $\epsilon$NFAs; they are only a convenient
       
   413 tool for translating regular expressions into automata. So we are not
       
   414 going to implementing them explicitly, but translate them immediately
       
   415 into NFAs (in a sense $\epsilon$NFAs are just a convenient API for
       
   416 lazy people ;o).  How does the translation work? Well we have to find
       
   417 all transitions of the form
       
   418 
       
   419 \[
       
   420 q\stackrel{\epsilon}{\longrightarrow}\ldots\stackrel{\epsilon}{\longrightarrow}
       
   421 \;\stackrel{a}{\longrightarrow}\;
       
   422 \stackrel{\epsilon}{\longrightarrow}\ldots\stackrel{\epsilon}{\longrightarrow} q'
       
   423 \]
       
   424 
       
   425 \noindent where somewhere in the ``middle'' is an $a$-transition. We
       
   426 replace them with $q \stackrel{a}{\longrightarrow} q'$. Doing this to
       
   427 the $\epsilon$NFA on the right-hand side above gives the NFA
       
   428 
       
   429 \begin{center}
       
   430 \begin{tikzpicture}[>=stealth',very thick,
       
   431     every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]
       
   432 \node[state,initial]  (r_1)  {$R_1$};
       
   433 \node[state] (r_2) [above=of r_1] {$R_2$};
       
   434 \node[state, accepting] (r_3) [right=of r_1] {$R_3$};
       
   435 \path[->] (r_1) edge node [above]  {$b$} (r_3);
       
   436 \path[->] (r_2) edge [bend left] node [above]  {$a$} (r_3);
       
   437 \path[->] (r_1) edge [bend left] node  [left] {$a$} (r_2);
       
   438 \path[->] (r_2) edge [bend left] node  [right] {$a$} (r_1);
       
   439 \path[->] (r_1) edge [loop below] node  {$a$} ();
       
   440 \path[->] (r_1) edge [bend right] node [below]  {$a$} (r_3);
       
   441 \end{tikzpicture}
       
   442 \end{center}
       
   443 
       
   444 \noindent where the single $\epsilon$-transition is replaced by
       
   445 three additional $a$-transitions. Please do the calculations yourself
       
   446 and verify that I did not forget any transition.
       
   447 
       
   448 So in what follows, whenever we are given an $\epsilon$NFA we will
       
   449 replace it by an equivalent NFA. The Scala code for this translation
       
   450 is given in Figure~\ref{enfa}. The main workhorse in this code is a
       
   451 function that calculates a fixpoint of function (Lines 5--10). This
       
   452 function is used for ``discovering'' which states are reachable by
       
   453 $\epsilon$-transitions. Once no new state is discovered, a fixpoint is
       
   454 reached. This is used for example when calculating the starting states
       
   455 of an equivalent NFA (see Line 36): we start with all starting states
       
   456 of the $\epsilon$NFA and then look for all additional states that can
       
   457 be reached by $\epsilon$-transitions. We keep on doing this until no
       
   458 new state can be reached. This is what the $\epsilon$-closure, named
       
   459 in the code \texttt{ecl}, calculates. Similarly, an accepting state of
       
   460 the NFA is when we can reach an accepting state of the $\epsilon$NFA
       
   461 by $\epsilon$-transitions.
       
   462 
       
   463 
       
   464 \begin{figure}[p]
       
   465 \small
       
   466 \lstinputlisting[numbers=left]{../progs/display/enfa.scala}
       
   467 
       
   468 \caption{A Scala function that translates $\epsilon$NFA into NFAs. The
       
   469   transition function of $\epsilon$NFA takes as input an \texttt{Option[C]}.
       
   470   \texttt{None} stands for an $\epsilon$-transition; \texttt{Some(c)}
       
   471   for a ``proper'' transition consuming a character. The functions in
       
   472   Lines 18--26 calculate
       
   473   all states reachable by one or more $\epsilon$-transition for a given
       
   474   set of states. The NFA is constructed in Lines 36--38.
       
   475   Note the interesting commands in Lines 5 and 6: their purpose is
       
   476   to ensure that \texttt{fixpT} is the tail-recursive version of
       
   477   the fixpoint construction; otherwise we would quickly get a
       
   478   stack-overflow exception, even on small examples, due to limitations
       
   479   of the JVM.
       
   480   \label{enfa}}
       
   481 \end{figure}
       
   482 
       
   483 Also look carefully how the transitions of $\epsilon$NFAs are
       
   484 implemented. The additional possibility of performing silent
       
   485 transitions is encoded by using \texttt{Option[C]} as the type for the
       
   486 ``input''. The \texttt{Some}s stand for ``proper'' transitions where
       
   487 a character is consumed; \texttt{None} stands for
       
   488 $\epsilon$-transitions. The transition functions for the two
       
   489 $\epsilon$NFAs from the beginning of this section can be defined as
       
   490 
       
   491 {\small\begin{lstlisting}[language=Scala]
       
   492 val enfa_trans1 : (State, Option[Char]) :=> Set[State] = 
       
   493   { case (Q0, Some('a')) => Set(Q0)
       
   494     case (Q0, None) => Set(Q1, Q2)
       
   495     case (Q1, Some('a')) => Set(Q1)
       
   496     case (Q2, Some('b')) => Set(Q2) }
       
   497 
       
   498 val enfa_trans2 : (State, Option[Char]) :=> Set[State] = 
       
   499   { case (R1, Some('b')) => Set(R3)
       
   500     case (R1, None) => Set(R2)
       
   501     case (R2, Some('a')) => Set(R1, R3) }
       
   502 \end{lstlisting}}
       
   503 
       
   504 \noindent
       
   505 I hope you agree now with my earlier statement that the $\epsilon$NFAs
       
   506 are just an API for NFAs.
       
   507 
       
   508 \subsection*{Thompson Construction}
       
   509 
       
   510 Having the translation of $\epsilon$NFAs to NFAs in place, we can
       
   511 finally return to the problem of translating regular expressions into
       
   512 equivalent NFAs. Recall that by equivalent we mean that the NFAs
       
   513 recognise the same language. Consider the simple regular expressions
       
   514 $\ZERO$, $\ONE$ and $c$. They can be translated into equivalent NFAs
       
   515 as follows:
       
   516 
       
   517 \begin{equation}\mbox{
   192 \begin{tabular}[t]{l@{\hspace{10mm}}l}
   518 \begin{tabular}[t]{l@{\hspace{10mm}}l}
   193 \raisebox{1mm}{$\ZERO$} & 
   519 \raisebox{1mm}{$\ZERO$} & 
   194 \begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
   520 \begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
   195 \node[state, initial]  (q_0)  {$\mbox{}$};
   521 \node[state, initial]  (Q_0)  {$\mbox{}$};
   196 \end{tikzpicture}\\\\
   522 \end{tikzpicture}\\\\
   197 \raisebox{1mm}{$\ONE$} & 
   523 \raisebox{1mm}{$\ONE$} & 
   198 \begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
   524 \begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
   199 \node[state, initial, accepting]  (q_0)  {$\mbox{}$};
   525 \node[state, initial, accepting]  (Q_0)  {$\mbox{}$};
   200 \end{tikzpicture}\\\\
   526 \end{tikzpicture}\\\\
   201 \raisebox{2mm}{$c$} & 
   527 \raisebox{3mm}{$c$} & 
   202 \begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
   528 \begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
   203 \node[state, initial]  (q_0)  {$\mbox{}$};
   529 \node[state, initial]  (Q_0)  {$\mbox{}$};
   204 \node[state, accepting]  (q_1)  [right=of q_0] {$\mbox{}$};
   530 \node[state, accepting]  (Q_1)  [right=of Q_0] {$\mbox{}$};
   205 \path[->] (q_0) edge node [below]  {$c$} (q_1);
   531 \path[->] (Q_0) edge node [below]  {$c$} (Q_1);
   206 \end{tikzpicture}\\\\
   532 \end{tikzpicture}\\
   207 \end{tabular}
   533 \end{tabular}}\label{simplecases}
   208 \end{center}
   534 \end{equation}
   209 
   535 
   210 \noindent The case for the sequence regular expression $r_1
   536 \noindent
   211 \cdot r_2$ is as follows: We are given by recursion two
   537 I let you think whether the NFAs can match exactly those strings the
   212 automata representing $r_1$ and $r_2$ respectively. 
   538 regular expressions can match. To do this translation in code we need
       
   539 a way to construct states ``programatically''...and as an additional
       
   540 constraint Scala needs to recognise that these states are being distinct.
       
   541 For this I implemented in Figure~\ref{thompson1} a class
       
   542 \texttt{TState} that includes a counter and a companion object that
       
   543 increases this counter whenever a new state is created.\footnote{You might
       
   544   have to read up what \emph{companion objects} do in Scala.}
       
   545 
       
   546 \begin{figure}[p]
       
   547 \small
       
   548 \lstinputlisting[numbers=left]{../progs/display/thompson1.scala}
       
   549 \caption{The first part of the Thompson Construction. Lines 7--16
       
   550   implement a way of how to create new states that are all
       
   551   distinct by virtue of a counter. This counter is
       
   552   increased in the companion object of \texttt{TState}
       
   553   whenever a new state is created. The code in Lines 24--40
       
   554   constructs NFAs for the simple regular expressions $\ZERO$, $\ONE$ and $c$.
       
   555   Compare this code with the pictures given in \eqref{simplecases} on
       
   556   Page~\pageref{simplecases}.
       
   557   \label{thompson1}}
       
   558 \end{figure}
       
   559 
       
   560 \begin{figure}[p]
       
   561 \small
       
   562 \lstinputlisting[numbers=left]{../progs/display/thompson2.scala}
       
   563 \caption{The second part of the Thompson Construction implementing
       
   564   the composition of NFAs according to $\cdot$, $+$ and ${}^*$.
       
   565   The implicit class about rich partial functions
       
   566   implements the infix operation \texttt{+++} which
       
   567   combines an $\epsilon$NFA transition with a NFA transition
       
   568   (both are given as partial functions---but with different type!).\label{thompson2}}
       
   569 \end{figure}
       
   570 
       
   571 The case for the sequence regular expression $r_1 \cdot r_2$ is a bit more
       
   572 complicated: Say, we are given by recursion two NFAs representing the regular
       
   573 expressions $r_1$ and $r_2$ respectively.
   213 
   574 
   214 \begin{center}
   575 \begin{center}
   215 \begin{tikzpicture}[node distance=3mm,
   576 \begin{tikzpicture}[node distance=3mm,
   216                              >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
   577     >=stealth',very thick,
   217 \node[state, initial]  (q_0)  {$\mbox{}$};
   578     every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
   218 \node (r_1)  [right=of q_0] {$\ldots$};
   579 \node[state, initial]  (Q_0)  {$\mbox{}$};
   219 \node[state, accepting]  (t_1)  [right=of r_1] {$\mbox{}$};
   580 \node[state, initial]  (Q_01) [below=1mm of Q_0] {$\mbox{}$};  
   220 \node[state, accepting]  (t_2)  [above=of t_1] {$\mbox{}$};
   581 \node[state, initial]  (Q_02) [above=1mm of Q_0] {$\mbox{}$};  
   221 \node[state, accepting]  (t_3)  [below=of t_1] {$\mbox{}$};
   582 \node (R_1)  [right=of Q_0] {$\ldots$};
   222 \node[state, initial]  (a_0)  [right=2.5cm of t_1] {$\mbox{}$};
   583 \node[state, accepting]  (T_1)  [right=of R_1] {$\mbox{}$};
   223 \node (b_1)  [right=of a_0] {$\ldots$};
   584 \node[state, accepting]  (T_2)  [above=of T_1] {$\mbox{}$};
       
   585 \node[state, accepting]  (T_3)  [below=of T_1] {$\mbox{}$};
       
   586 
       
   587 \node (A_0)  [right=2.5cm of T_1] {$\mbox{}$};
       
   588 \node[state, initial]  (A_01)  [above=1mm of A_0] {$\mbox{}$};
       
   589 \node[state, initial]  (A_02)  [below=1mm of A_0] {$\mbox{}$};
       
   590 
       
   591 \node (b_1)  [right=of A_0] {$\ldots$};
   224 \node[state, accepting]  (c_1)  [right=of b_1] {$\mbox{}$};
   592 \node[state, accepting]  (c_1)  [right=of b_1] {$\mbox{}$};
   225 \node[state, accepting]  (c_2)  [above=of c_1] {$\mbox{}$};
   593 \node[state, accepting]  (c_2)  [above=of c_1] {$\mbox{}$};
   226 \node[state, accepting]  (c_3)  [below=of c_1] {$\mbox{}$};
   594 \node[state, accepting]  (c_3)  [below=of c_1] {$\mbox{}$};
   227 \begin{pgfonlayer}{background}
   595 \begin{pgfonlayer}{background}
   228 \node (1) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (q_0) (r_1) (t_1) (t_2) (t_3)] {};
   596   \node (1) [rounded corners, inner sep=1mm, thick,
   229 \node (2) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (a_0) (b_1) (c_1) (c_2) (c_3)] {};
   597     draw=black!60, fill=black!20, fit= (Q_0) (R_1) (T_1) (T_2) (T_3)] {};
       
   598   \node (2) [rounded corners, inner sep=1mm, thick,
       
   599     draw=black!60, fill=black!20, fit= (A_0) (b_1) (c_1) (c_2) (c_3)] {};
   230 \node [yshift=2mm] at (1.north) {$r_1$};
   600 \node [yshift=2mm] at (1.north) {$r_1$};
   231 \node [yshift=2mm] at (2.north) {$r_2$};
   601 \node [yshift=2mm] at (2.north) {$r_2$};
   232 \end{pgfonlayer}
   602 \end{pgfonlayer}
   233 \end{tikzpicture}
   603 \end{tikzpicture}
   234 \end{center}
   604 \end{center}
   235 
   605 
   236 \noindent The first automaton has some accepting states. We
   606 \noindent The first NFA has some accepting states and the second some
   237 obtain an automaton for $r_1\cdot r_2$ by connecting these
   607 starting states. We obtain an $\epsilon$NFA for $r_1\cdot r_2$ by
   238 accepting states with $\epsilon$-transitions to the starting
   608 connecting the accepting states of the first NFA with
   239 state of the second automaton. By doing so we make them
   609 $\epsilon$-transitions to the starting states of the second
   240 non-accepting like so:
   610 automaton. By doing so we make the accepting states of the first NFA
       
   611 to be non-accepting like so:
   241 
   612 
   242 \begin{center}
   613 \begin{center}
   243 \begin{tikzpicture}[node distance=3mm,
   614 \begin{tikzpicture}[node distance=3mm,
   244                              >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
   615     >=stealth',very thick,
   245 \node[state, initial]  (q_0)  {$\mbox{}$};
   616     every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
   246 \node (r_1)  [right=of q_0] {$\ldots$};
   617 \node[state, initial]  (Q_0)  {$\mbox{}$};
       
   618 \node[state, initial]  (Q_01) [below=1mm of Q_0] {$\mbox{}$};  
       
   619 \node[state, initial]  (Q_02) [above=1mm of Q_0] {$\mbox{}$};  
       
   620 \node (r_1)  [right=of Q_0] {$\ldots$};
   247 \node[state]  (t_1)  [right=of r_1] {$\mbox{}$};
   621 \node[state]  (t_1)  [right=of r_1] {$\mbox{}$};
   248 \node[state]  (t_2)  [above=of t_1] {$\mbox{}$};
   622 \node[state]  (t_2)  [above=of t_1] {$\mbox{}$};
   249 \node[state]  (t_3)  [below=of t_1] {$\mbox{}$};
   623 \node[state]  (t_3)  [below=of t_1] {$\mbox{}$};
   250 \node[state]  (a_0)  [right=2.5cm of t_1] {$\mbox{}$};
   624 
   251 \node (b_1)  [right=of a_0] {$\ldots$};
   625 \node  (A_0)  [right=2.5cm of t_1] {$\mbox{}$};
       
   626 \node[state]  (A_01)  [above=1mm of A_0] {$\mbox{}$};
       
   627 \node[state]  (A_02)  [below=1mm of A_0] {$\mbox{}$};
       
   628 
       
   629 \node (b_1)  [right=of A_0] {$\ldots$};
   252 \node[state, accepting]  (c_1)  [right=of b_1] {$\mbox{}$};
   630 \node[state, accepting]  (c_1)  [right=of b_1] {$\mbox{}$};
   253 \node[state, accepting]  (c_2)  [above=of c_1] {$\mbox{}$};
   631 \node[state, accepting]  (c_2)  [above=of c_1] {$\mbox{}$};
   254 \node[state, accepting]  (c_3)  [below=of c_1] {$\mbox{}$};
   632 \node[state, accepting]  (c_3)  [below=of c_1] {$\mbox{}$};
   255 \path[->] (t_1) edge node [above, pos=0.3]  {$\epsilon$} (a_0);
   633 \path[->] (t_1) edge (A_01);
   256 \path[->] (t_2) edge node [above]  {$\epsilon$} (a_0);
   634 \path[->] (t_2) edge node [above]  {$\epsilon$s} (A_01);
   257 \path[->] (t_3) edge node [below]  {$\epsilon$} (a_0);
   635 \path[->] (t_3) edge (A_01);
   258 
   636 \path[->] (t_1) edge (A_02);
       
   637 \path[->] (t_2) edge (A_02);
       
   638 \path[->] (t_3) edge node [below]  {$\epsilon$s} (A_02);
       
   639  
   259 \begin{pgfonlayer}{background}
   640 \begin{pgfonlayer}{background}
   260 \node (3) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (q_0) (c_1) (c_2) (c_3)] {};
   641   \node (3) [rounded corners, inner sep=1mm, thick,
       
   642     draw=black!60, fill=black!20, fit= (Q_0) (c_1) (c_2) (c_3)] {};
   261 \node [yshift=2mm] at (3.north) {$r_1\cdot r_2$};
   643 \node [yshift=2mm] at (3.north) {$r_1\cdot r_2$};
   262 \end{pgfonlayer}
   644 \end{pgfonlayer}
   263 \end{tikzpicture}
   645 \end{tikzpicture}
   264 \end{center}
   646 \end{center}
   265 
   647 
   266 \noindent The case for the choice regular expression $r_1 +
   648 \noindent The idea behind this construction is that the start of any
   267 r_2$ is slightly different: We are given by recursion two
   649 string is first recognised by the first NFA, then we silently change
   268 automata representing $r_1$ and $r_2$ respectively. 
   650 to the second NFA; the ending of the string is recognised by the
   269 
   651 second NFA...just like matching of a string by the regular expression
   270 \begin{center}
   652 $r_1\cdot r_2$. The Scala code for this construction is given in
       
   653 Figure~\ref{thompson2} in Lines 16--23. The starting states of the
       
   654 $\epsilon$NFA are the starting states of the first NFA (corresponding
       
   655 to $r_1$); the accepting function is the accepting function of the
       
   656 second NFA (corresponding to $r_2$). The new transition function is
       
   657 all the ``old'' transitions plus the $\epsilon$-transitions connecting
       
   658 the accepting states of the first NFA to the starting states of the
       
   659 first NFA (Lines 18 and 19). The $\epsilon$NFA is then immediately
       
   660 translated in a NFA.
       
   661 
       
   662 
       
   663 The case for the alternative regular expression $r_1 + r_2$ is
       
   664 slightly different: We are given by recursion two NFAs representing
       
   665 $r_1$ and $r_2$ respectively. Each NFA has some starting states and
       
   666 some accepting states. We obtain a NFA for the regular expression $r_1
       
   667 + r_2$ by composing the transition functions (this crucially depends
       
   668 on knowing that the states of each component NFA are distinct---recall
       
   669 we implemented for this to hold some bespoke code for states). We also
       
   670 need to combine the starting states and accepting functions
       
   671 appropriately.
       
   672 
       
   673 \begin{center}
       
   674 \begin{tabular}[t]{ccc}
   271 \begin{tikzpicture}[node distance=3mm,
   675 \begin{tikzpicture}[node distance=3mm,
   272                              >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
   676     >=stealth',very thick,
       
   677     every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},
       
   678     baseline=(current bounding box.center)]
   273 \node at (0,0)  (1)  {$\mbox{}$};
   679 \node at (0,0)  (1)  {$\mbox{}$};
   274 \node[state, initial]  (2)  [above right=16mm of 1] {$\mbox{}$};
   680 \node  (2)  [above=10mm of 1] {};
   275 \node[state, initial]  (3)  [below right=16mm of 1] {$\mbox{}$};
   681 \node[state, initial]  (4)  [above=1mm of 2] {$\mbox{}$};
   276 
   682 \node[state, initial]  (5)  [below=1mm of 2] {$\mbox{}$};
   277 \node (a)  [right=of 2] {$\ldots$};
   683 \node[state, initial]  (3)  [below=10mm of 1] {$\mbox{}$};
   278 \node[state, accepting]  (a1)  [right=of a] {$\mbox{}$};
   684 
       
   685 \node (a)  [right=of 2] {$\ldots\,$};
       
   686 \node  (a1)  [right=of a] {$$};
   279 \node[state, accepting]  (a2)  [above=of a1] {$\mbox{}$};
   687 \node[state, accepting]  (a2)  [above=of a1] {$\mbox{}$};
   280 \node[state, accepting]  (a3)  [below=of a1] {$\mbox{}$};
   688 \node[state, accepting]  (a3)  [below=of a1] {$\mbox{}$};
   281 
   689 
   282 \node (b)  [right=of 3] {$\ldots$};
   690 \node (b)  [right=of 3] {$\ldots$};
   283 \node[state, accepting]  (b1)  [right=of b] {$\mbox{}$};
   691 \node[state, accepting]  (b1)  [right=of b] {$\mbox{}$};
   288 \node (2) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (3) (b1) (b2) (b3)] {};
   696 \node (2) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (3) (b1) (b2) (b3)] {};
   289 \node [yshift=3mm] at (1.north) {$r_1$};
   697 \node [yshift=3mm] at (1.north) {$r_1$};
   290 \node [yshift=3mm] at (2.north) {$r_2$};
   698 \node [yshift=3mm] at (2.north) {$r_2$};
   291 \end{pgfonlayer}
   699 \end{pgfonlayer}
   292 \end{tikzpicture}
   700 \end{tikzpicture}
   293 \end{center}
   701 &
   294 
   702 \mbox{}\qquad\tikz{\draw[>=stealth,line width=2mm,->] (0,0) -- (1, 0)}\quad\mbox{}
   295 \noindent Each automaton has a single start state and
   703 &
   296 potentially several accepting states. We obtain a NFA for the
   704 \begin{tikzpicture}[node distance=3mm,
   297 regular expression $r_1 + r_2$ by introducing a new starting
   705     >=stealth',very thick,
   298 state and connecting it with an $\epsilon$-transition to the
   706     every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},
   299 two starting states above, like so
   707     baseline=(current bounding box.center)]
   300 
   708 \node at (0,0) (1)  {$\mbox{}$};
   301 \begin{center}
   709 \node (2)  [above=10mm of 1] {$$};
   302 \hspace{2cm}\begin{tikzpicture}[node distance=3mm,
   710 \node[state, initial]  (4)  [above=1mm of 2] {$\mbox{}$};
   303                              >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
   711 \node[state, initial]  (5)  [below=1mm of 2] {$\mbox{}$};
   304 \node at (0,0) [state, initial]  (1)  {$\mbox{}$};
   712 \node[state, initial]  (3)  [below=10mm of 1] {$\mbox{}$};
   305 \node[state]  (2)  [above right=16mm of 1] {$\mbox{}$};
   713 
   306 \node[state]  (3)  [below right=16mm of 1] {$\mbox{}$};
   714 \node (a)  [right=of 2] {$\ldots\,$};
   307 
   715 \node (a1)  [right=of a] {$$};
   308 \node (a)  [right=of 2] {$\ldots$};
       
   309 \node[state, accepting]  (a1)  [right=of a] {$\mbox{}$};
       
   310 \node[state, accepting]  (a2)  [above=of a1] {$\mbox{}$};
   716 \node[state, accepting]  (a2)  [above=of a1] {$\mbox{}$};
   311 \node[state, accepting]  (a3)  [below=of a1] {$\mbox{}$};
   717 \node[state, accepting]  (a3)  [below=of a1] {$\mbox{}$};
   312 
   718 
   313 \node (b)  [right=of 3] {$\ldots$};
   719 \node (b)  [right=of 3] {$\ldots$};
   314 \node[state, accepting]  (b1)  [right=of b] {$\mbox{}$};
   720 \node[state, accepting]  (b1)  [right=of b] {$\mbox{}$};
   315 \node[state, accepting]  (b2)  [above=of b1] {$\mbox{}$};
   721 \node[state, accepting]  (b2)  [above=of b1] {$\mbox{}$};
   316 \node[state, accepting]  (b3)  [below=of b1] {$\mbox{}$};
   722 \node[state, accepting]  (b3)  [below=of b1] {$\mbox{}$};
   317 
   723 
   318 \path[->] (1) edge node [above]  {$\epsilon$} (2);
   724 %\path[->] (1) edge node [above]  {$\epsilon$} (2);
   319 \path[->] (1) edge node [below]  {$\epsilon$} (3);
   725 %\path[->] (1) edge node [below]  {$\epsilon$} (3);
   320 
   726 
   321 \begin{pgfonlayer}{background}
   727 \begin{pgfonlayer}{background}
   322 \node (3) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (1) (a2) (a3) (b2) (b3)] {};
   728 \node (3) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (1) (a2) (a3) (b2) (b3)] {};
   323 \node [yshift=3mm] at (3.north) {$r_1+ r_2$};
   729 \node [yshift=3mm] at (3.north) {$r_1+ r_2$};
   324 \end{pgfonlayer}
   730 \end{pgfonlayer}
   325 \end{tikzpicture}
   731 \end{tikzpicture}
   326 \end{center}
   732 \end{tabular}
   327 
   733 \end{center}
   328 \noindent 
   734 
   329 Finally for the $*$-case we have an automaton for $r$
   735 \noindent The code for this construction is in Figure~\ref{thompson2}
   330 
   736 in Lines 25--33.
   331 \begin{center}
   737 
       
   738 Finally for the $*$-case we have a NFA for $r$ and connect its
       
   739 accepting states to a new starting state via
       
   740 $\epsilon$-transitions. This new starting state is also an accepting
       
   741 state, because $r^*$ can recognise the empty string.
       
   742 
       
   743 \begin{center}
       
   744 \begin{tabular}[b]{@{}ccc@{}}  
   332 \begin{tikzpicture}[node distance=3mm,
   745 \begin{tikzpicture}[node distance=3mm,
   333                              >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
   746     >=stealth',very thick,
   334 \node at (0,0)  (1)  {$\mbox{}$};
   747     every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},
   335 \node[state, initial]  (2)  [right=16mm of 1] {$\mbox{}$};
   748     baseline=(current bounding box.north)]
       
   749 \node (2)  {$\mbox{}$};
       
   750 \node[state, initial]  (4)  [above=1mm of 2] {$\mbox{}$};
       
   751 \node[state, initial]  (5)  [below=1mm of 2] {$\mbox{}$};  
   336 \node (a)  [right=of 2] {$\ldots$};
   752 \node (a)  [right=of 2] {$\ldots$};
   337 \node[state, accepting]  (a1)  [right=of a] {$\mbox{}$};
   753 \node[state, accepting]  (a1)  [right=of a] {$\mbox{}$};
   338 \node[state, accepting]  (a2)  [above=of a1] {$\mbox{}$};
   754 \node[state, accepting]  (a2)  [above=of a1] {$\mbox{}$};
   339 \node[state, accepting]  (a3)  [below=of a1] {$\mbox{}$};
   755 \node[state, accepting]  (a3)  [below=of a1] {$\mbox{}$};
   340 \begin{pgfonlayer}{background}
   756 \begin{pgfonlayer}{background}
   341 \node (1) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (2) (a1) (a2) (a3)] {};
   757 \node (1) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (2) (a1) (a2) (a3)] {};
   342 \node [yshift=3mm] at (1.north) {$r$};
   758 \node [yshift=3mm] at (1.north) {$r$};
   343 \end{pgfonlayer}
   759 \end{pgfonlayer}
   344 \end{tikzpicture}
   760 \end{tikzpicture}
   345 \end{center}
   761 &
   346 
   762 \raisebox{-16mm}{\;\tikz{\draw[>=stealth,line width=2mm,->] (0,0) -- (1, 0)}}
   347 \noindent and connect its accepting states to a new starting
   763 &
   348 state via $\epsilon$-transitions. This new starting state is
       
   349 also an accepting state, because $r^*$ can recognise the
       
   350 empty string. This gives the following automaton for $r^*$:
       
   351 
       
   352 \begin{center}
       
   353 \begin{tikzpicture}[node distance=3mm,
   764 \begin{tikzpicture}[node distance=3mm,
   354                              >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
   765     >=stealth',very thick,
       
   766     every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},
       
   767     baseline=(current bounding box.north)]
   355 \node at (0,0) [state, initial,accepting]  (1)  {$\mbox{}$};
   768 \node at (0,0) [state, initial,accepting]  (1)  {$\mbox{}$};
   356 \node[state]  (2)  [right=16mm of 1] {$\mbox{}$};
   769 \node (2)  [right=16mm of 1] {$\mbox{}$};
       
   770 \node[state]  (4)  [above=1mm of 2] {$\mbox{}$};
       
   771 \node[state]  (5)  [below=1mm of 2] {$\mbox{}$};  
   357 \node (a)  [right=of 2] {$\ldots$};
   772 \node (a)  [right=of 2] {$\ldots$};
   358 \node[state]  (a1)  [right=of a] {$\mbox{}$};
   773 \node[state]  (a1)  [right=of a] {$\mbox{}$};
   359 \node[state]  (a2)  [above=of a1] {$\mbox{}$};
   774 \node[state]  (a2)  [above=of a1] {$\mbox{}$};
   360 \node[state]  (a3)  [below=of a1] {$\mbox{}$};
   775 \node[state]  (a3)  [below=of a1] {$\mbox{}$};
   361 \path[->] (1) edge node [above]  {$\epsilon$} (2);
   776 \path[->] (1) edge node [below]  {$\epsilon$} (4);
   362 \path[->] (a1) edge [bend left=45] node [above]  {$\epsilon$} (1);
   777 \path[->] (1) edge node [below]  {$\epsilon$} (5);
       
   778 \path[->] (a1) edge [bend left=45] node [below]  {$\epsilon$} (1);
   363 \path[->] (a2) edge [bend right] node [below]  {$\epsilon$} (1);
   779 \path[->] (a2) edge [bend right] node [below]  {$\epsilon$} (1);
   364 \path[->] (a3) edge [bend left=45] node [below]  {$\epsilon$} (1);
   780 \path[->] (a3) edge [bend left=45] node [below]  {$\epsilon$} (1);
   365 \begin{pgfonlayer}{background}
   781 \begin{pgfonlayer}{background}
   366 \node (2) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (1) (a2) (a3)] {};
   782 \node (2) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (1) (a2) (a3)] {};
   367 \node [yshift=3mm] at (2.north) {$r^*$};
   783 \node [yshift=3mm] at (2.north) {$r^*$};
   368 \end{pgfonlayer}
   784 \end{pgfonlayer}
   369 \end{tikzpicture}
   785 \end{tikzpicture}    
   370 \end{center}
   786 \end{tabular}
   371 
   787 \end{center}
   372 \noindent This construction of a NFA from a regular expression
   788 
   373 was invented by Ken Thompson in 1968.
   789 \noindent 
   374 
   790 The corresponding code is in Figure~\ref{thompson2} in Lines 35--43)
   375 
   791 
   376 \subsubsection*{Subset Construction}
   792 To sum up, you can see in the sequence and star cases the need of
   377 
   793 having silent $\epsilon$-transitions. Similarly the alternative case
   378 What is interesting is that for every NFA we can find a DFA
   794 shows the need of the NFA-nondeterminism. It seems awkward to form the
   379 which recognises the same language. This can, for example, be
   795 `alternative' composition of two DFAs, because DFA do not allow
   380 done by the \emph{subset construction}. Consider again the NFA
   796 several starting and successor states. All these constructions can now
   381 below on the left, consisting of nodes labeled $0$, $1$ and $2$. 
   797 be put together in the following recursive function:
       
   798 
       
   799 
       
   800 {\small\begin{lstlisting}[language=Scala]
       
   801 def thompson(r: Rexp) : NFAt = r match {
       
   802   case ZERO => NFA_ZERO()
       
   803   case ONE => NFA_ONE()
       
   804   case CHAR(c) => NFA_CHAR(c)  
       
   805   case ALT(r1, r2) => NFA_ALT(thompson(r1), thompson(r2))
       
   806   case SEQ(r1, r2) => NFA_SEQ(thompson(r1), thompson(r2))
       
   807   case STAR(r1) => NFA_STAR(thompson(r1))
       
   808 }
       
   809 \end{lstlisting}}
       
   810 
       
   811 \noindent
       
   812 It calculates a NFA from a regular expressions. At last we can run
       
   813 NFAs for the our evil regular expression examples.  The graph on the
       
   814 left shows that when translating a regular expression such as
       
   815 $a^{\{n\}}$ into a NFA, the size can blow up and then even the
       
   816 relative fast (on small examples) breadth-first search can be
       
   817 slow. The graph on the right shows that with `evil' regular
       
   818 expressions the depth-first search can be abysmally slow. Even if the
       
   819 graphs not completely overlap with the curves of Python, Ruby and
       
   820 Java, they are similar enough. OK\ldots now you know why regular
       
   821 expression matchers in those languages are so slow. 
       
   822 
       
   823 
       
   824 \begin{center}
       
   825 \begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1mm}}c@{}}  
       
   826 \begin{tikzpicture}
       
   827 \begin{axis}[
       
   828     title={Graph: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings 
       
   829            $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
       
   830     title style={yshift=-2ex},
       
   831     xlabel={$n$},
       
   832     x label style={at={(1.05,0.0)}},
       
   833     ylabel={time in secs},
       
   834     enlargelimits=false,
       
   835     xtick={0,5,...,30},
       
   836     xmax=33,
       
   837     ymax=35,
       
   838     ytick={0,5,...,30},
       
   839     scaled ticks=false,
       
   840     axis lines=left,
       
   841     width=5.5cm,
       
   842     height=4cm, 
       
   843     legend entries={Python,Ruby, breadth-first NFA},
       
   844     legend style={at={(0.5,-0.25)},anchor=north,font=\small},
       
   845     legend cell align=left]
       
   846   \addplot[blue,mark=*, mark options={fill=white}] table {re-python.data};
       
   847   \addplot[brown,mark=triangle*, mark options={fill=white}] table {re-ruby.data};
       
   848   % breath-first search in NFAs
       
   849   \addplot[red,mark=*, mark options={fill=white}] table {
       
   850     1 0.00586
       
   851     2 0.01209
       
   852     3 0.03076
       
   853     4 0.08269
       
   854     5 0.12881
       
   855     6 0.25146
       
   856     7 0.51377
       
   857     8 0.89079
       
   858     9 1.62802
       
   859     10 3.05326
       
   860     11 5.92437
       
   861     12 11.67863
       
   862     13 24.00568
       
   863   };
       
   864 \end{axis}
       
   865 \end{tikzpicture}
       
   866 &
       
   867 \begin{tikzpicture}
       
   868 \begin{axis}[
       
   869     title={Graph: $(a^*)^* \cdot b$ and strings 
       
   870            $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
       
   871     title style={yshift=-2ex},
       
   872     xlabel={$n$},
       
   873     x label style={at={(1.05,0.0)}},
       
   874     ylabel={time in secs},
       
   875     enlargelimits=false,
       
   876     xtick={0,5,...,30},
       
   877     xmax=33,
       
   878     ymax=35,
       
   879     ytick={0,5,...,30},
       
   880     scaled ticks=false,
       
   881     axis lines=left,
       
   882     width=5.5cm,
       
   883     height=4cm, 
       
   884     legend entries={Python, Java, depth-first NFA},
       
   885     legend style={at={(0.5,-0.25)},anchor=north,font=\small},
       
   886     legend cell align=left]
       
   887   \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
       
   888   \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};  
       
   889   % depth-first search in NFAs
       
   890   \addplot[red,mark=*, mark options={fill=white}] table {
       
   891     1 0.00605
       
   892     2 0.03086
       
   893     3 0.11994
       
   894     4 0.45389
       
   895     5 2.06192
       
   896     6 8.04894
       
   897     7 32.63549
       
   898   };
       
   899 \end{axis}
       
   900 \end{tikzpicture}
       
   901 \end{tabular}
       
   902 \end{center}
       
   903 
       
   904 
       
   905 
       
   906 \subsection*{Subset Construction}
       
   907 
       
   908 Of course, some developers of regular expression matchers are aware of
       
   909 these problems with sluggish NFAs and try to address them. One common
       
   910 technique for alleviating the problem I like to show you in this
       
   911 section. This will also explain why we insisted on polymorphic types in
       
   912 our DFA code (remember I used \texttt{A} and \texttt{C} for the types
       
   913 of states and the input, see Figure~\ref{dfa} on
       
   914 Page~\pageref{dfa}).\bigskip
       
   915 
       
   916 \noindent
       
   917 To start remember that we did not bother with defining and
       
   918 implementing $\epsilon$NFAs: we immediately translated them into
       
   919 equivalent NFAs. Equivalent in the sense of accepting the same
       
   920 language (though we only claimed this and did not prove it
       
   921 rigorously). Remember also that NFAs have non-deterministic
       
   922 transitions defined as a relation or implemented as function returning
       
   923 sets of states.  This non-determinism is crucial for the Thompson
       
   924 Construction to work (recall the cases for $\cdot$, $+$ and
       
   925 ${}^*$). But this non-determinism makes it harder with NFAs to decide
       
   926 when a string is accepted or not; whereas such a decision is rather
       
   927 straightforward with DFAs: recall their transition function is a
       
   928 \emph{function} that returns a single state. So with DFAs we do not
       
   929 have to search at all.  What is perhaps interesting is the fact that
       
   930 for every NFA we can find a DFA that also recognises the same
       
   931 language. This might sound a bit paradoxical: NFA $\rightarrow$
       
   932 decision of acceptance hard; DFA $\rightarrow$ decision easy. But this
       
   933 \emph{is} true\ldots but of course there is always a caveat---nothing
       
   934 ever is for free in life.
       
   935 
       
   936 There are actually a number of methods for transforming a NFA into
       
   937 an equivalent DFA, but the most famous one is the \emph{subset
       
   938   construction}. Consider the following NFA where the states are
       
   939 labelled with $0$, $1$ and $2$.
   382 
   940 
   383 \begin{center}
   941 \begin{center}
   384 \begin{tabular}{c@{\hspace{10mm}}c}
   942 \begin{tabular}{c@{\hspace{10mm}}c}
   385 \begin{tikzpicture}[scale=0.7,>=stealth',very thick,
   943 \begin{tikzpicture}[scale=0.7,>=stealth',very thick,
   386                     every state/.style={minimum size=0pt,
   944                     every state/.style={minimum size=0pt,
   387                     draw=blue!50,very thick,fill=blue!20},
   945                     draw=blue!50,very thick,fill=blue!20},
   388                     baseline=0mm]
   946                     baseline=(current bounding box.center)]
   389 \node[state,initial]  (q_0)  {$0$};
   947 \node[state,initial]  (Q_0)  {$0$};
   390 \node[state] (q_1) [above=of q_0] {$1$};
   948 \node[state] (Q_1) [below=of Q_0] {$1$};
   391 \node[state, accepting] (q_2) [below=of q_0] {$2$};
   949 \node[state, accepting] (Q_2) [below=of Q_1] {$2$};
   392 \path[->] (q_0) edge node [left]  {$\epsilon$} (q_1);
   950 
   393 \path[->] (q_0) edge node [left]  {$\epsilon$} (q_2);
   951 \path[->] (Q_0) edge node [right]  {$b$} (Q_1);
   394 \path[->] (q_0) edge [loop right] node  {$a$} ();
   952 \path[->] (Q_1) edge node [right]  {$a,b$} (Q_2);
   395 \path[->] (q_1) edge [loop above] node  {$a$} ();
   953 \path[->] (Q_0) edge [loop above] node  {$a, b$} ();
   396 \path[->] (q_2) edge [loop below] node  {$b$} ();
       
   397 \end{tikzpicture}
   954 \end{tikzpicture}
   398 &
   955 &
   399 \begin{tabular}{r|cl}
   956 \begin{tabular}{r|ll}
   400 nodes & $a$ & $b$\\
   957 states & $a$ & $b$\\
   401 \hline
   958 \hline
   402 $\{\}\phantom{\star}$ & $\{\}$ & $\{\}$\\
   959 $\{\}\phantom{\star}$ & $\{\}$ & $\{\}$\\
   403 $\{0\}\phantom{\star}$       & $\{0,1,2\}$   & $\{2\}$\\
   960 start: $\{0\}\phantom{\star}$       & $\{0\}$   & $\{0,1\}$\\
   404 $\{1\}\phantom{\star}$       & $\{1\}$       & $\{\}$\\
   961 $\{1\}\phantom{\star}$       & $\{2\}$       & $\{2\}$\\
   405 $\{2\}\star$  & $\{\}$ & $\{2\}$\\
   962 $\{2\}\star$  & $\{\}$ & $\{\}$\\
   406 $\{0,1\}\phantom{\star}$     & $\{0,1,2\}$   & $\{2\}$\\
   963 $\{0,1\}\phantom{\star}$     & $\{0,2\}$   & $\{0,1,2\}$\\
   407 $\{0,2\}\star$ & $\{0,1,2\}$   & $\{2\}$\\
   964 $\{0,2\}\star$ & $\{0\}$   & $\{0,1\}$\\
   408 $\{1,2\}\star$ & $\{1\}$       & $\{2\}$\\
   965 $\{1,2\}\star$ & $\{2\}$       & $\{2\}$\\
   409 s: $\{0,1,2\}\star$ & $\{0,1,2\}$ & $\{2\}$\\
   966 $\{0,1,2\}\star$ & $\{0,2\}$ & $\{0,1,2\}$\\
   410 \end{tabular}
   967 \end{tabular}
   411 \end{tabular}
   968 \end{tabular}
   412 \end{center}
   969 \end{center}
   413 
   970 
   414 \noindent The nodes of the DFA are given by calculating all
   971 \noindent The states of the corresponding DFA are given by generating
   415 subsets of the set of nodes of the NFA (seen in the nodes
   972 all subsets of the set $\{0,1,2\}$ (seen in the states column
   416 column on the right). The table shows the transition function
   973 in the table on the right). The other columns define the transition
   417 for the DFA. The first row states that $\{\}$ is the
   974 function for the DFA for inputs $a$ and $b$. The first row states that
   418 sink node which has transitions for $a$ and $b$ to itself.
   975 $\{\}$ is the sink state which has transitions for $a$ and $b$ to
   419 The next three lines are calculated as follows: 
   976 itself.  The next three lines are calculated as follows:
   420 
   977 
   421 \begin{itemize}
   978 \begin{itemize}
   422 \item suppose you calculate the entry for the transition for
   979 \item Suppose you calculate the entry for the $a$-transition for state
   423       $a$ and the node $\{0\}$
   980   $\{0\}$. Look for all states in the NFA that can be reached by such
   424 \item start from the node $0$ in the NFA
   981   a transition from this state; this is only state $0$; therefore from
   425 \item do as many $\epsilon$-transition as you can obtaining a
   982   state $\{0\}$ we can go to state $\{0\}$ via an $a$-transition.
   426       set of nodes, in this case $\{0,1,2\}$
   983 \item Do the same for the $b$-transition; you can reach states $0$ and
   427 \item filter out all notes that do not allow an $a$-transition
   984   $1$ in the NFA; therefore in the DFA we can go from state $\{0\}$ to
   428       from this set, this excludes $2$ which does not permit a
   985   state $\{0,1\}$ via an $b$-transition.
   429       $a$-transition
   986 \item Continue with the states $\{1\}$ and $\{2\}$.
   430 \item from the remaining set, do as many $\epsilon$-transition
       
   431       as you can, this yields again $\{0,1,2\}$     
       
   432 \item the resulting set specifies the transition from $\{0\}$
       
   433       when given an $a$ 
       
   434 \end{itemize}
   987 \end{itemize}
   435 
   988 
   436 \noindent So the transition from the state $\{0\}$ reading an
   989 \noindent
   437 $a$ goes to the state $\{0,1,2\}$. Similarly for the other
   990 Once you filled in the transitions for `simple' states $\{0\}$
   438 entries in the rows for $\{0\}$, $\{1\}$ and $\{2\}$. The
   991 .. $\{2\}$, you only have to build the union for the compound states
   439 other rows are calculated by just taking the union of the
   992 $\{0,1\}$, $\{0,2\}$ and so on. For example for $\{0,1\}$ you take the
   440 single node entries. For example for $a$ and $\{0,1\}$ we need
   993 union of Line $\{0\}$ and Line $\{1\}$, which gives $\{0,2\}$ for $a$,
   441 to take the union of $\{0,1,2\}$ (for $0$) and $\{1\}$ (for
   994 and $\{0,1,2\}$ for $b$. And so on.
   442 $1$). The starting state of the DFA can be calculated from the
   995 
   443 starting state of the NFA, that is $0$, and then do as many
   996 The starting state of the DFA can be calculated from the starting
   444 $\epsilon$-transitions as possible. This gives $\{0,1,2\}$
   997 states of the NFA, that is in this case $\{0\}$. But in general there
   445 which is the starting state of the DFA. The terminal states in
   998 can of course be many starting states in the NFA and you would take
   446 the DFA are given by all sets that contain a $2$, which is the
   999 the corresponding subset as \emph{the} starting state of the DFA.
   447 terminal state of the NFA. This completes the subset
  1000 
   448 construction. So the corresponding DFA to the NFA from 
  1001 The accepting states in the DFA are given by all sets that contain a
   449 above is
  1002 $2$, which is the only accpting state in this NFA. But again in
   450 
  1003 general if the subset contains any accepting state from the NFA, then
   451 \begin{center}
  1004 the corresponding state in the DFA is accepting as well.  This
   452 \begin{tikzpicture}[scale=0.7,>=stealth',very thick,
  1005 completes the subset construction. The corresponding DFA for the NFA
       
  1006 shown above is:
       
  1007 
       
  1008 \begin{equation}
       
  1009 \begin{tikzpicture}[scale=0.8,>=stealth',very thick,
   453                     every state/.style={minimum size=0pt,
  1010                     every state/.style={minimum size=0pt,
   454                     draw=blue!50,very thick,fill=blue!20},
  1011                       draw=blue!50,very thick,fill=blue!20},
   455                     baseline=0mm]
  1012                     baseline=(current bounding box.center)]
   456 \node[state,initial,accepting]  (q012)  {$0,1,2$};
  1013 \node[state,initial]  (q0)  {$0$};
   457 \node[state,accepting] (q02) [right=of q012] {$0,2$};
  1014 \node[state] (q01) [right=of q0] {$0,1$};
   458 \node[state] (q01) [above=of q02] {$0,1$};
  1015 \node[state,accepting] (q02) [below=of q01] {$0,2$};
   459 \node[state,accepting] (q12) [below=of q02] {$1,2$};
  1016 \node[state,accepting] (q012) [right=of q02] {$0,1,2$};
   460 \node[state] (q0) [right=2cm of q01] {$0$};
  1017 \node[state] (q1) [below=0.5cm of q0] {$1$};
   461 \node[state] (q1) [right=2.5cm of q02] {$1$};
  1018 \node[state,accepting] (q2) [below=1cm of q1] {$2$};
   462 \node[state,accepting] (q2) [right=1.5cm of q12] {$2$};
  1019 \node[state] (qn) [below left=1cm of q2] {$\{\}$};
   463 \node[state] (qn) [right=of q1] {$\{\}$};
  1020 \node[state,accepting] (q12) [below right=1cm of q2] {$1,2$};
   464 
  1021 
   465 \path[->] (q012) edge [loop below] node  {$a$} ();
  1022 \path[->] (q0) edge node [above] {$b$} (q01);
   466 \path[->] (q012) edge node [above]  {$b$} (q2);
  1023 \path[->] (q01) edge node [above] {$b$} (q012);
   467 \path[->] (q12) edge [bend left] node [below,pos=0.4]  {$a$} (q1);
  1024 \path[->] (q0) edge [loop above] node  {$a$} ();
   468 \path[->] (q12) edge node [below]  {$b$} (q2);
  1025 \path[->] (q012) edge [loop right] node  {$b$} ();
   469 \path[->] (q02) edge node [above]  {$a$} (q012);
  1026 \path[->] (q012) edge node [below] {$a$} (q02);
   470 \path[->] (q02) edge [bend left] node [above, pos=0.8]  {$b$} (q2);
  1027 \path[->] (q02) edge node [below] {$a$} (q0);
   471 \path[->] (q01) edge node [below]  {$a$} (q012);
  1028 \path[->] (q01) edge [bend left] node [left]  {$a$} (q02);
   472 \path[->] (q01) edge [bend left] node [above]  {$b$} (q2);
  1029 \path[->] (q02) edge [bend left] node [right]  {$b$} (q01);
   473 \path[->] (q0) edge node [below]  {$a$} (q012);
  1030 \path[->] (q1) edge node [left] {$a,b$} (q2);
   474 \path[->] (q0) edge node [right, pos=0.2]  {$b$} (q2);
  1031 \path[->] (q12) edge node [right] {$a, b$} (q2);
   475 \path[->] (q1) edge [loop above] node  {$a$} ();
  1032 \path[->] (q2) edge node [right] {$a, b$} (qn);
   476 \path[->] (q1) edge node [above]  {$b$} (qn);
  1033 \path[->] (qn) edge [loop left] node  {$a,b$} ();
   477 \path[->] (q2) edge [loop right] node  {$b$} ();
  1034 \end{tikzpicture}\label{subsetdfa}
   478 \path[->] (q2) edge node [below]  {$a$} (qn);
  1035 \end{equation}
   479 \path[->] (qn) edge [loop above] node  {$a,b$} ();
  1036 
   480 \end{tikzpicture}
  1037 \noindent
   481 \end{center}
  1038 Please check that this is indeed a DFA. The big question is whether
   482 
  1039 this DFA can recognise the same language as the NFA we started with?
   483 
       
   484 
       
   485 There are two points to note: One is that very often the
       
   486 resulting DFA contains a number of ``dead'' nodes that are
       
   487 never reachable from the starting state. For example
       
   488 there is no way to reach node $\{0,2\}$ from the starting
       
   489 state $\{0,1,2\}$. I let you find the other dead states.
       
   490 In effect the DFA in this example is not a minimal DFA. Such
       
   491 dead nodes can be safely removed without changing the language
       
   492 that is recognised by the DFA. Another point is that in some
       
   493 cases, however, the subset construction produces a DFA that
       
   494 does \emph{not} contain any dead nodes\ldots{}that means it
       
   495 calculates a minimal DFA. Which in turn means that in some
       
   496 cases the number of nodes by going from NFAs to DFAs
       
   497 exponentially increases, namely by $2^n$ (which is the number
       
   498 of subsets you can form for $n$ nodes). 
       
   499 
       
   500 Removing all the dead states in the automaton above,
       
   501 gives a much more legible automaton, namely
       
   502 
       
   503 \begin{center}
       
   504 \begin{tikzpicture}[scale=0.7,>=stealth',very thick,
       
   505                     every state/.style={minimum size=0pt,
       
   506                     draw=blue!50,very thick,fill=blue!20},
       
   507                     baseline=0mm]
       
   508 \node[state,initial,accepting]  (q012)  {$0,1,2$};
       
   509 \node[state,accepting] (q2) [right=of q012] {$2$};
       
   510 \node[state] (qn) [right=of q2] {$\{\}$};
       
   511 
       
   512 \path[->] (q012) edge [loop below] node  {$a$} ();
       
   513 \path[->] (q012) edge node [above]  {$b$} (q2);
       
   514 \path[->] (q2) edge [loop below] node  {$b$} ();
       
   515 \path[->] (q2) edge node [below]  {$a$} (qn);
       
   516 \path[->] (qn) edge [loop above] node  {$a,b$} ();
       
   517 \end{tikzpicture}
       
   518 \end{center}
       
   519 
       
   520 \noindent Now the big question is whether this DFA
       
   521 can recognise the same language as the NFA we started with.
       
   522 I let you ponder about this question.
  1040 I let you ponder about this question.
   523 
  1041 
   524 \subsubsection*{Brzozowski's Method}
  1042 
   525 
  1043 There are also two points to note: One is that very often in the
   526 As said before, we can also go into the other direction---from
  1044 subset construction the resulting DFA contains a number of ``dead''
   527 DFAs to regular expressions. Brzozowski's method calculates
  1045 states that are never reachable from the starting state. This is
   528 a regular expression using familiar transformations for
  1046 obvious in the example, where state $\{1\}$, $\{2\}$, $\{1,2\}$ and
   529 solving equational systems. Consider the DFA:
  1047 $\{\}$ can never be reached from the starting state. But this might
   530 
  1048 not always be as obvious as that. In effect the DFA in this example is
   531 \begin{center}
  1049 not a \emph{minimal} DFA (more about this in a minute). Such dead
   532 \begin{tikzpicture}[scale=1.5,>=stealth',very thick,auto,
  1050 states can be safely removed without changing the language that is
   533                     every state/.style={minimum size=0pt,
  1051 recognised by the DFA. Another point is that in some cases, however,
   534                     inner sep=2pt,draw=blue!50,very thick,
  1052 the subset construction produces a DFA that does \emph{not} contain
   535                     fill=blue!20}]
  1053 any dead states\ldots{}this means it calculates a minimal DFA. Which
   536   \node[state, initial]        (q0) at ( 0,1) {$q_0$};
  1054 in turn means that in some cases the number of states can by going
   537   \node[state]                    (q1) at ( 1,1) {$q_1$};
  1055 from NFAs to DFAs exponentially increase, namely by $2^n$ (which is
   538   \node[state, accepting] (q2) at ( 2,1) {$q_2$};
  1056 the number of subsets you can form for sets of $n$ states).  This blow
   539   \path[->] (q0) edge[bend left] node[above] {$a$} (q1)
  1057 up in the number of states in the DFA is again bad news for how
   540             (q1) edge[bend left] node[above] {$b$} (q0)
  1058 quickly you can decide whether a string is accepted by a DFA or
   541             (q2) edge[bend left=50] node[below] {$b$} (q0)
  1059 not. So the caveat with DFAs is that they might make the task of
   542             (q1) edge node[above] {$a$} (q2)
  1060 finding the next state trival, but might require $2^n$ times as many
   543             (q2) edge [loop right] node {$a$} ()
  1061 states then a NFA.\bigskip
   544             (q0) edge [loop below] node {$b$} ();
  1062 
   545 \end{tikzpicture}
  1063 \noindent
   546 \end{center}
  1064 To conclude this section, how conveniently we can
   547 
  1065 implement the subset construction with our versions of NFAs and
   548 \noindent for which we can set up the following equational
  1066 DFAs? Very conveninetly. The code is just:
   549 system
  1067 
   550 
  1068 {\small\begin{lstlisting}[language=Scala]
   551 \begin{eqnarray}
  1069 def subset[A, C](nfa: NFA[A, C]) : DFA[Set[A], C] = {
   552 q_0 & = & \ONE + q_0\,b + q_1\,b +  q_2\,b\\
  1070   DFA(nfa.starts, 
   553 q_1 & = & q_0\,a\\
  1071       { case (qs, c) => nfa.nexts(qs, c) }, 
   554 q_2 & = & q_1\,a + q_2\,a
  1072       _.exists(nfa.fins))
   555 \end{eqnarray}
  1073 }
   556 
  1074 \end{lstlisting}}  
   557 \noindent There is an equation for each node in the DFA. Let
  1075 
   558 us have a look how the right-hand sides of the equations are
  1076 \noindent
   559 constructed. First have a look at the second equation: the
  1077 The interesting point in this code is that the state type of the
   560 left-hand side is $q_1$ and the right-hand side $q_0\,a$. The
  1078 calculated DFA is \texttt{Set[A]}. Think carefully that this works out
   561 right-hand side is essentially all possible ways how to end up
  1079 correctly.
   562 in node $q_1$. There is only one incoming edge from $q_0$ consuming
  1080 
   563 an $a$.  Therefore the right hand side is this
  1081 The DFA is then given by three components: the starting states, the
   564 state followed by character---in this case $q_0\,a$. Now lets
  1082 transition function and the accepting-states function.  The starting
   565 have a look at the third equation: there are two incoming
  1083 states are a set in the given NFA, but a single state in the DFA.  The
   566 edges for $q_2$. Therefore we have two terms, namely $q_1\,a$ and
  1084 transition function, given the state \texttt{qs} and input \texttt{c},
   567 $q_2\,a$. These terms are separated by $+$. The first states
  1085 needs to produce the next state: this is the set of all NFA states
   568 that if in state $q_1$ consuming an $a$ will bring you to
  1086 that are reachable from each state in \texttt{qs}. The function
   569 $q_2$, and the secont that being in $q_2$ and consuming an $a$
  1087 \texttt{nexts} from the NFA class already calculates this for us. The
   570 will make you stay in $q_2$. The right-hand side of the
  1088 accepting-states function for the DFA is true henevner at least one
   571 first equation is constructed similarly: there are three 
  1089 state in the subset is accepting (that is true) in the NFA.\medskip
   572 incoming edges, therefore there are three terms. There is
  1090 
   573 one exception in that we also ``add'' $\ONE$ to the
  1091 \noindent
   574 first equation, because it corresponds to the starting state
  1092 You might be able to spend some quality time tinkering with this code
   575 in the DFA.
  1093 and time to ponder about it. Then you will probably notice that it is
   576 
  1094 actually a bit silly. The whole point of translating the NFA into a
   577 Having constructed the equational system, the question is
  1095 DFA via the subset construction is to make the decision of whether a
   578 how to solve it? Remarkably the rules are very similar to
  1096 string is accepted or not faster. Given the code above, the generated
   579 solving usual linear equational systems. For example the
  1097 DFA will be exactly as fast, or as slow, as the NFA we started with
   580 second equation does not contain the variable $q_1$ on the
  1098 (actually it will even be a tiny bit slower). The reason is that we
   581 right-hand side of the equation. We can therefore eliminate 
  1099 just re-use the \texttt{nexts} function from the NFA. This function
   582 $q_1$ from the system by just substituting this equation
  1100 implements the non-deterministic breadth-first search.  You might be
   583 into the other two. This gives
  1101 thinking: This is cheating! \ldots{} Well, not quite as you will see
   584 
  1102 later, but in terms of speed we still need to work a bit in order to
   585 \begin{eqnarray}
  1103 get sometimes(!) a faster DFA. Let's do this next.
   586 q_0 & = & \ONE + q_0\,b + q_0\,a\,b +  q_2\,b\\
  1104 
   587 q_2 & = & q_0\,a\,a + q_2\,a
  1105 \subsection*{DFA Minimisation}
   588 \end{eqnarray}
  1106 
   589 
  1107 As seen in \eqref{subsetdfa}, the subset construction from NFA to a
   590 \noindent where in Equation (4) we have two occurences
  1108 DFA can result in a rather ``inefficient'' DFA. Meaning there are
   591 of $q_0$. Like the laws about $+$ and $\cdot$, we can simplify 
  1109 states that are not needed. There are two kinds of such unneeded
   592 Equation (4) to obtain the following two equations:
  1110 states: \emph{unreachable} states and \emph{non-distinguishable}
   593 
  1111 states. The first kind of states can just be removed without affecting
   594 \begin{eqnarray}
  1112 the language that can be recognised (after all they are
   595 q_0 & = & \ONE + q_0\,(b + a\,b) +  q_2\,b\\
  1113 unreachable). The second kind can also be recognised and thus a DFA
   596 q_2 & = & q_0\,a\,a + q_2\,a
  1114 can be \emph{minimised} by the following algorithm:
   597 \end{eqnarray}
       
   598  
       
   599 \noindent Unfortunately we cannot make any more progress with
       
   600 substituting equations, because both (6) and (7) contain the
       
   601 variable on the left-hand side also on the right-hand side.
       
   602 Here we need to now use a law that is different from the usual
       
   603 laws about linear equations. It is called \emph{Arden's rule}.
       
   604 It states that if an equation is of the form $q = q\,r + s$
       
   605 then it can be transformed to $q = s\, r^*$. Since we can
       
   606 assume $+$ is symmetric, Equation (7) is of that form: $s$ is
       
   607 $q_0\,a\,a$ and $r$ is $a$. That means we can transform
       
   608 (7) to obtain the two new equations
       
   609 
       
   610 \begin{eqnarray}
       
   611 q_0 & = & \ONE + q_0\,(b + a\,b) +  q_2\,b\\
       
   612 q_2 & = & q_0\,a\,a\,(a^*)
       
   613 \end{eqnarray}
       
   614 
       
   615 \noindent Now again we can substitute the second equation into
       
   616 the first in order to eliminate the variable $q_2$.
       
   617 
       
   618 \begin{eqnarray}
       
   619 q_0 & = & \ONE + q_0\,(b + a\,b) +  q_0\,a\,a\,(a^*)\,b
       
   620 \end{eqnarray}
       
   621 
       
   622 \noindent Pulling $q_0$ out as a single factor gives:
       
   623 
       
   624 \begin{eqnarray}
       
   625 q_0 & = & \ONE + q_0\,(b + a\,b + a\,a\,(a^*)\,b)
       
   626 \end{eqnarray}
       
   627 
       
   628 \noindent This equation is again of the form so that we can
       
   629 apply Arden's rule ($r$ is $b + a\,b + a\,a\,(a^*)\,b$ and $s$
       
   630 is $\ONE$). This gives as solution for $q_0$ the following
       
   631 regular expression:
       
   632 
       
   633 \begin{eqnarray}
       
   634 q_0 & = & \ONE\,(b + a\,b + a\,a\,(a^*)\,b)^*
       
   635 \end{eqnarray}
       
   636 
       
   637 \noindent Since this is a regular expression, we can simplify
       
   638 away the $\ONE$ to obtain the slightly simpler regular
       
   639 expression
       
   640 
       
   641 \begin{eqnarray}
       
   642 q_0 & = & (b + a\,b + a\,a\,(a^*)\,b)^*
       
   643 \end{eqnarray}
       
   644 
       
   645 \noindent 
       
   646 Now we can unwind this process and obtain the solutions
       
   647 for the other equations. This gives:
       
   648 
       
   649 \begin{eqnarray}
       
   650 q_0 & = & (b + a\,b + a\,a\,(a^*)\,b)^*\\
       
   651 q_1 & = & (b + a\,b + a\,a\,(a^*)\,b)^*\,a\\
       
   652 q_2 & = & (b + a\,b + a\,a\,(a^*)\,b)^*\,a\,a\,(a)^*
       
   653 \end{eqnarray}
       
   654 
       
   655 \noindent Finally, we only need to ``add'' up the equations
       
   656 which correspond to a terminal state. In our running example,
       
   657 this is just $q_2$. Consequently, a regular expression
       
   658 that recognises the same language as the automaton is
       
   659 
       
   660 \[
       
   661 (b + a\,b + a\,a\,(a^*)\,b)^*\,a\,a\,(a)^*
       
   662 \]
       
   663 
       
   664 \noindent You can somewhat crosscheck your solution
       
   665 by taking a string the regular expression can match and
       
   666 and see whether it can be matched by the automaton.
       
   667 One string for example is $aaa$ and \emph{voila} this 
       
   668 string is also matched by the automaton.
       
   669 
       
   670 We should prove that Brzozowski's method really produces
       
   671 an equivalent  regular expression for the automaton. But
       
   672 for the purposes of this module, we omit this.
       
   673 
       
   674 \subsubsection*{Automata Minimization}
       
   675 
       
   676 As seen in the subset construction, the translation 
       
   677 of an NFA to a DFA can result in a rather ``inefficient'' 
       
   678 DFA. Meaning there are states that are not needed. A
       
   679 DFA can be \emph{minimised} by the following algorithm:
       
   680 
  1115 
   681 \begin{enumerate}
  1116 \begin{enumerate}
   682 \item Take all pairs $(q, p)$ with $q \not= p$
  1117 \item Take all pairs $(q, p)$ with $q \not= p$
   683 \item Mark all pairs that accepting and non-accepting states
  1118 \item Mark all pairs that accepting and non-accepting states
   684 \item For all unmarked pairs $(q, p)$ and all characters $c$
  1119 \item For all unmarked pairs $(q, p)$ and all characters $c$
   691       are marked. If there is one, then also mark $(q, p)$.
  1126       are marked. If there is one, then also mark $(q, p)$.
   692 \item Repeat last step until no change.
  1127 \item Repeat last step until no change.
   693 \item All unmarked pairs can be merged.
  1128 \item All unmarked pairs can be merged.
   694 \end{enumerate}
  1129 \end{enumerate}
   695 
  1130 
   696 \noindent To illustrate this algorithm, consider the following 
  1131 \noindent Unfortunately, once we throw away all unreachable states in
   697 DFA.
  1132 \eqref{subsetdfa}, all remaining states are needed.  In order to
       
  1133 illustrate the minimisation algorithm, consider the following DFA.
   698 
  1134 
   699 \begin{center}
  1135 \begin{center}
   700 \begin{tikzpicture}[>=stealth',very thick,auto,
  1136 \begin{tikzpicture}[>=stealth',very thick,auto,
   701                     every state/.style={minimum size=0pt,
  1137                     every state/.style={minimum size=0pt,
   702                     inner sep=2pt,draw=blue!50,very thick,
  1138                     inner sep=2pt,draw=blue!50,very thick,
   703                     fill=blue!20}]
  1139                     fill=blue!20}]
   704 \node[state,initial]  (q_0)  {$q_0$};
  1140 \node[state,initial]  (Q_0)  {$Q_0$};
   705 \node[state] (q_1) [right=of q_0] {$q_1$};
  1141 \node[state] (Q_1) [right=of Q_0] {$Q_1$};
   706 \node[state] (q_2) [below right=of q_0] {$q_2$};
  1142 \node[state] (Q_2) [below right=of Q_0] {$Q_2$};
   707 \node[state] (q_3) [right=of q_2] {$q_3$};
  1143 \node[state] (Q_3) [right=of Q_2] {$Q_3$};
   708 \node[state, accepting] (q_4) [right=of q_1] {$q_4$};
  1144 \node[state, accepting] (Q_4) [right=of Q_1] {$Q_4$};
   709 \path[->] (q_0) edge node [above]  {$a$} (q_1);
  1145 \path[->] (Q_0) edge node [above]  {$a$} (Q_1);
   710 \path[->] (q_1) edge node [above]  {$a$} (q_4);
  1146 \path[->] (Q_1) edge node [above]  {$a$} (Q_4);
   711 \path[->] (q_4) edge [loop right] node  {$a, b$} ();
  1147 \path[->] (Q_4) edge [loop right] node  {$a, b$} ();
   712 \path[->] (q_3) edge node [right]  {$a$} (q_4);
  1148 \path[->] (Q_3) edge node [right]  {$a$} (Q_4);
   713 \path[->] (q_2) edge node [above]  {$a$} (q_3);
  1149 \path[->] (Q_2) edge node [above]  {$a$} (Q_3);
   714 \path[->] (q_1) edge node [right]  {$b$} (q_2);
  1150 \path[->] (Q_1) edge node [right]  {$b$} (Q_2);
   715 \path[->] (q_0) edge node [above]  {$b$} (q_2);
  1151 \path[->] (Q_0) edge node [above]  {$b$} (Q_2);
   716 \path[->] (q_2) edge [loop left] node  {$b$} ();
  1152 \path[->] (Q_2) edge [loop left] node  {$b$} ();
   717 \path[->] (q_3) edge [bend left=95, looseness=1.3] node 
  1153 \path[->] (Q_3) edge [bend left=95, looseness=1.3] node 
   718   [below]  {$b$} (q_0);
  1154   [below]  {$b$} (Q_0);
   719 \end{tikzpicture}
  1155 \end{tikzpicture}
   720 \end{center}
  1156 \end{center}
   721 
  1157 
   722 \noindent In Step 1 and 2 we consider essentially a triangle
  1158 \noindent In Step 1 and 2 we consider essentially a triangle
   723 of the form
  1159 of the form
   734 \draw (1,0) -- (1, 4);
  1170 \draw (1,0) -- (1, 4);
   735 \draw (2,0) -- (2, 3);
  1171 \draw (2,0) -- (2, 3);
   736 \draw (3,0) -- (3, 2);
  1172 \draw (3,0) -- (3, 2);
   737 \draw (4,0) -- (4, 1);
  1173 \draw (4,0) -- (4, 1);
   738 
  1174 
   739 \draw (0.5,-0.5) node {$q_0$}; 
  1175 \draw (0.5,-0.5) node {$Q_0$}; 
   740 \draw (1.5,-0.5) node {$q_1$}; 
  1176 \draw (1.5,-0.5) node {$Q_1$}; 
   741 \draw (2.5,-0.5) node {$q_2$}; 
  1177 \draw (2.5,-0.5) node {$Q_2$}; 
   742 \draw (3.5,-0.5) node {$q_3$};
  1178 \draw (3.5,-0.5) node {$Q_3$};
   743  
  1179  
   744 \draw (-0.5, 3.5) node {$q_1$}; 
  1180 \draw (-0.5, 3.5) node {$Q_1$}; 
   745 \draw (-0.5, 2.5) node {$q_2$}; 
  1181 \draw (-0.5, 2.5) node {$Q_2$}; 
   746 \draw (-0.5, 1.5) node {$q_3$}; 
  1182 \draw (-0.5, 1.5) node {$Q_3$}; 
   747 \draw (-0.5, 0.5) node {$q_4$}; 
  1183 \draw (-0.5, 0.5) node {$Q_4$}; 
   748 
  1184 
   749 \draw (0.5,0.5) node {\large$\star$}; 
  1185 \draw (0.5,0.5) node {\large$\star$}; 
   750 \draw (1.5,0.5) node {\large$\star$}; 
  1186 \draw (1.5,0.5) node {\large$\star$}; 
   751 \draw (2.5,0.5) node {\large$\star$}; 
  1187 \draw (2.5,0.5) node {\large$\star$}; 
   752 \draw (3.5,0.5) node {\large$\star$};
  1188 \draw (3.5,0.5) node {\large$\star$};
   753 \end{tikzpicture}
  1189 \end{tikzpicture}
   754 \end{center}
  1190 \end{center}
   755 
  1191 
   756 \noindent where the lower row is filled with stars, because in
  1192 \noindent where the lower row is filled with stars, because in
   757 the corresponding pairs there is always one state that is
  1193 the corresponding pairs there is always one state that is
   758 accepting ($q_4$) and a state that is non-accepting (the other
  1194 accepting ($Q_4$) and a state that is non-accepting (the other
   759 states).
  1195 states).
   760 
  1196 
   761 Now in Step 3 we need to fill in more stars according whether 
  1197 In Step 3 we need to fill in more stars according whether 
   762 one of the next-state pairs are marked. We have to do this 
  1198 one of the next-state pairs are marked. We have to do this 
   763 for every unmarked field until there is no change anymore.
  1199 for every unmarked field until there is no change anymore.
   764 This gives the triangle
  1200 This gives the triangle
   765 
  1201 
   766 \begin{center}
  1202 \begin{center}
   775 \draw (1,0) -- (1, 4);
  1211 \draw (1,0) -- (1, 4);
   776 \draw (2,0) -- (2, 3);
  1212 \draw (2,0) -- (2, 3);
   777 \draw (3,0) -- (3, 2);
  1213 \draw (3,0) -- (3, 2);
   778 \draw (4,0) -- (4, 1);
  1214 \draw (4,0) -- (4, 1);
   779 
  1215 
   780 \draw (0.5,-0.5) node {$q_0$}; 
  1216 \draw (0.5,-0.5) node {$Q_0$}; 
   781 \draw (1.5,-0.5) node {$q_1$}; 
  1217 \draw (1.5,-0.5) node {$Q_1$}; 
   782 \draw (2.5,-0.5) node {$q_2$}; 
  1218 \draw (2.5,-0.5) node {$Q_2$}; 
   783 \draw (3.5,-0.5) node {$q_3$};
  1219 \draw (3.5,-0.5) node {$Q_3$};
   784  
  1220  
   785 \draw (-0.5, 3.5) node {$q_1$}; 
  1221 \draw (-0.5, 3.5) node {$Q_1$}; 
   786 \draw (-0.5, 2.5) node {$q_2$}; 
  1222 \draw (-0.5, 2.5) node {$Q_2$}; 
   787 \draw (-0.5, 1.5) node {$q_3$}; 
  1223 \draw (-0.5, 1.5) node {$Q_3$}; 
   788 \draw (-0.5, 0.5) node {$q_4$}; 
  1224 \draw (-0.5, 0.5) node {$Q_4$}; 
   789 
  1225 
   790 \draw (0.5,0.5) node {\large$\star$}; 
  1226 \draw (0.5,0.5) node {\large$\star$}; 
   791 \draw (1.5,0.5) node {\large$\star$}; 
  1227 \draw (1.5,0.5) node {\large$\star$}; 
   792 \draw (2.5,0.5) node {\large$\star$}; 
  1228 \draw (2.5,0.5) node {\large$\star$}; 
   793 \draw (3.5,0.5) node {\large$\star$};
  1229 \draw (3.5,0.5) node {\large$\star$};
   796 \draw (0.5,3.5) node {\large$\star$}; 
  1232 \draw (0.5,3.5) node {\large$\star$}; 
   797 \draw (1.5,2.5) node {\large$\star$}; 
  1233 \draw (1.5,2.5) node {\large$\star$}; 
   798 \end{tikzpicture}
  1234 \end{tikzpicture}
   799 \end{center}
  1235 \end{center}
   800 
  1236 
   801 \noindent which means states $q_0$ and $q_2$, as well as $q_1$
  1237 \noindent which means states $Q_0$ and $Q_2$, as well as $Q_1$
   802 and $q_3$ can be merged. This gives the following minimal DFA
  1238 and $Q_3$ can be merged. This gives the following minimal DFA
   803 
  1239 
   804 \begin{center}
  1240 \begin{center}
   805 \begin{tikzpicture}[>=stealth',very thick,auto,
  1241 \begin{tikzpicture}[>=stealth',very thick,auto,
   806                     every state/.style={minimum size=0pt,
  1242                     every state/.style={minimum size=0pt,
   807                     inner sep=2pt,draw=blue!50,very thick,
  1243                     inner sep=2pt,draw=blue!50,very thick,
   808                     fill=blue!20}]
  1244                     fill=blue!20}]
   809 \node[state,initial]  (q_02)  {$q_{0, 2}$};
  1245 \node[state,initial]  (Q_02)  {$Q_{0, 2}$};
   810 \node[state] (q_13) [right=of q_02] {$q_{1, 3}$};
  1246 \node[state] (Q_13) [right=of Q_02] {$Q_{1, 3}$};
   811 \node[state, accepting] (q_4) [right=of q_13] 
  1247 \node[state, accepting] (Q_4) [right=of Q_13] 
   812   {$q_{4\phantom{,0}}$};
  1248   {$Q_{4\phantom{,0}}$};
   813 \path[->] (q_02) edge [bend left] node [above]  {$a$} (q_13);
  1249 \path[->] (Q_02) edge [bend left] node [above]  {$a$} (Q_13);
   814 \path[->] (q_13) edge [bend left] node [below]  {$b$} (q_02);
  1250 \path[->] (Q_13) edge [bend left] node [below]  {$b$} (Q_02);
   815 \path[->] (q_02) edge [loop below] node  {$b$} ();
  1251 \path[->] (Q_02) edge [loop below] node  {$b$} ();
   816 \path[->] (q_13) edge node [above]  {$a$} (q_4);
  1252 \path[->] (Q_13) edge node [above]  {$a$} (Q_4);
   817 \path[->] (q_4) edge [loop above] node  {$a, b$} ();
  1253 \path[->] (Q_4) edge [loop above] node  {$a, b$} ();
   818 \end{tikzpicture}
  1254 \end{tikzpicture}
   819 \end{center}
  1255 \end{center}
   820 
  1256 
   821 \subsubsection*{Regular Languages}
  1257 
       
  1258 By the way, we are not bothering with implementing the above
       
  1259 minimisation algorith: while up to now all the transformations used
       
  1260 some clever composition of functions, the minimisation algorithm
       
  1261 cannot be implemented by just composing some functions. For this we
       
  1262 would require a more concrete representation of the transition
       
  1263 function (like maps). If we did this, however, then many advantages of
       
  1264 the functions would be thrown away. So the compromise is to not being
       
  1265 able to minimise (easily) our DFAs.
       
  1266 
       
  1267 \subsection*{Brzozowski's Method}
       
  1268 
       
  1269 I know this handout is already a long, long rant: but after all it is
       
  1270 a topic that has been researched for more than 60 years. If you
       
  1271 reflect on what you have read so far, the story is that you can take a
       
  1272 regular expression, translate it via the Thompson Construction into an
       
  1273 $\epsilon$NFA, then translate it into a NFA by removing all
       
  1274 $\epsilon$-transitions, and then via the subset construction obtain a
       
  1275 DFA. In all steps we made sure the language, or which strings can be
       
  1276 recognised, stays the same. Of couse we should have proved this in
       
  1277 each step, but let us cut corners here.  After the last section, we
       
  1278 can even minimise the DFA (maybe not in code). But again we made sure
       
  1279 the same language is recognised. You might be wondering: Can we go
       
  1280 into the other direction?  Can we go from a DFA and obtain a regular
       
  1281 expression that can recognise the same language as the DFA?\medskip
       
  1282 
       
  1283 \noindent
       
  1284 The answer is yes. Again there are several methods for calculating a
       
  1285 regular expression for a DFA. I will show you Brzozowski's method
       
  1286 because it calculates a regular expression using quite familiar
       
  1287 transformations for solving equational systems. Consider the DFA:
       
  1288 
       
  1289 \begin{center}
       
  1290 \begin{tikzpicture}[scale=1.5,>=stealth',very thick,auto,
       
  1291                     every state/.style={minimum size=0pt,
       
  1292                     inner sep=2pt,draw=blue!50,very thick,
       
  1293                     fill=blue!20}]
       
  1294   \node[state, initial]        (q0) at ( 0,1) {$Q_0$};
       
  1295   \node[state]                    (q1) at ( 1,1) {$Q_1$};
       
  1296   \node[state, accepting] (q2) at ( 2,1) {$Q_2$};
       
  1297   \path[->] (q0) edge[bend left] node[above] {$a$} (q1)
       
  1298             (q1) edge[bend left] node[above] {$b$} (q0)
       
  1299             (q2) edge[bend left=50] node[below] {$b$} (q0)
       
  1300             (q1) edge node[above] {$a$} (q2)
       
  1301             (q2) edge [loop right] node {$a$} ()
       
  1302             (q0) edge [loop below] node {$b$} ();
       
  1303 \end{tikzpicture}
       
  1304 \end{center}
       
  1305 
       
  1306 \noindent for which we can set up the following equational
       
  1307 system
       
  1308 
       
  1309 \begin{eqnarray}
       
  1310 Q_0 & = & \ONE + Q_0\,b + Q_1\,b +  Q_2\,b\\
       
  1311 Q_1 & = & Q_0\,a\\
       
  1312 Q_2 & = & Q_1\,a + Q_2\,a
       
  1313 \end{eqnarray}
       
  1314 
       
  1315 \noindent There is an equation for each node in the DFA. Let
       
  1316 us have a look how the right-hand sides of the equations are
       
  1317 constructed. First have a look at the second equation: the
       
  1318 left-hand side is $Q_1$ and the right-hand side $Q_0\,a$. The
       
  1319 right-hand side is essentially all possible ways how to end up
       
  1320 in node $Q_1$. There is only one incoming edge from $Q_0$ consuming
       
  1321 an $a$.  Therefore the right hand side is this
       
  1322 state followed by character---in this case $Q_0\,a$. Now lets
       
  1323 have a look at the third equation: there are two incoming
       
  1324 edges for $Q_2$. Therefore we have two terms, namely $Q_1\,a$ and
       
  1325 $Q_2\,a$. These terms are separated by $+$. The first states
       
  1326 that if in state $Q_1$ consuming an $a$ will bring you to
       
  1327 $Q_2$, and the second that being in $Q_2$ and consuming an $a$
       
  1328 will make you stay in $Q_2$. The right-hand side of the
       
  1329 first equation is constructed similarly: there are three 
       
  1330 incoming edges, therefore there are three terms. There is
       
  1331 one exception in that we also ``add'' $\ONE$ to the
       
  1332 first equation, because it corresponds to the starting state
       
  1333 in the DFA.
       
  1334 
       
  1335 Having constructed the equational system, the question is
       
  1336 how to solve it? Remarkably the rules are very similar to
       
  1337 solving usual linear equational systems. For example the
       
  1338 second equation does not contain the variable $Q_1$ on the
       
  1339 right-hand side of the equation. We can therefore eliminate 
       
  1340 $Q_1$ from the system by just substituting this equation
       
  1341 into the other two. This gives
       
  1342 
       
  1343 \begin{eqnarray}
       
  1344 Q_0 & = & \ONE + Q_0\,b + Q_0\,a\,b +  Q_2\,b\\
       
  1345 Q_2 & = & Q_0\,a\,a + Q_2\,a
       
  1346 \end{eqnarray}
       
  1347 
       
  1348 \noindent where in Equation (4) we have two occurrences
       
  1349 of $Q_0$. Like the laws about $+$ and $\cdot$, we can simplify 
       
  1350 Equation (4) to obtain the following two equations:
       
  1351 
       
  1352 \begin{eqnarray}
       
  1353 Q_0 & = & \ONE + Q_0\,(b + a\,b) +  Q_2\,b\\
       
  1354 Q_2 & = & Q_0\,a\,a + Q_2\,a
       
  1355 \end{eqnarray}
       
  1356  
       
  1357 \noindent Unfortunately we cannot make any more progress with
       
  1358 substituting equations, because both (6) and (7) contain the
       
  1359 variable on the left-hand side also on the right-hand side.
       
  1360 Here we need to now use a law that is different from the usual
       
  1361 laws about linear equations. It is called \emph{Arden's rule}.
       
  1362 It states that if an equation is of the form $q = q\,r + s$
       
  1363 then it can be transformed to $q = s\, r^*$. Since we can
       
  1364 assume $+$ is symmetric, Equation (7) is of that form: $s$ is
       
  1365 $Q_0\,a\,a$ and $r$ is $a$. That means we can transform
       
  1366 (7) to obtain the two new equations
       
  1367 
       
  1368 \begin{eqnarray}
       
  1369 Q_0 & = & \ONE + Q_0\,(b + a\,b) +  Q_2\,b\\
       
  1370 Q_2 & = & Q_0\,a\,a\,(a^*)
       
  1371 \end{eqnarray}
       
  1372 
       
  1373 \noindent Now again we can substitute the second equation into
       
  1374 the first in order to eliminate the variable $Q_2$.
       
  1375 
       
  1376 \begin{eqnarray}
       
  1377 Q_0 & = & \ONE + Q_0\,(b + a\,b) +  Q_0\,a\,a\,(a^*)\,b
       
  1378 \end{eqnarray}
       
  1379 
       
  1380 \noindent Pulling $Q_0$ out as a single factor gives:
       
  1381 
       
  1382 \begin{eqnarray}
       
  1383 Q_0 & = & \ONE + Q_0\,(b + a\,b + a\,a\,(a^*)\,b)
       
  1384 \end{eqnarray}
       
  1385 
       
  1386 \noindent This equation is again of the form so that we can
       
  1387 apply Arden's rule ($r$ is $b + a\,b + a\,a\,(a^*)\,b$ and $s$
       
  1388 is $\ONE$). This gives as solution for $Q_0$ the following
       
  1389 regular expression:
       
  1390 
       
  1391 \begin{eqnarray}
       
  1392 Q_0 & = & \ONE\,(b + a\,b + a\,a\,(a^*)\,b)^*
       
  1393 \end{eqnarray}
       
  1394 
       
  1395 \noindent Since this is a regular expression, we can simplify
       
  1396 away the $\ONE$ to obtain the slightly simpler regular
       
  1397 expression
       
  1398 
       
  1399 \begin{eqnarray}
       
  1400 Q_0 & = & (b + a\,b + a\,a\,(a^*)\,b)^*
       
  1401 \end{eqnarray}
       
  1402 
       
  1403 \noindent 
       
  1404 Now we can unwind this process and obtain the solutions
       
  1405 for the other equations. This gives:
       
  1406 
       
  1407 \begin{eqnarray}
       
  1408 Q_0 & = & (b + a\,b + a\,a\,(a^*)\,b)^*\\
       
  1409 Q_1 & = & (b + a\,b + a\,a\,(a^*)\,b)^*\,a\\
       
  1410 Q_2 & = & (b + a\,b + a\,a\,(a^*)\,b)^*\,a\,a\,(a)^*
       
  1411 \end{eqnarray}
       
  1412 
       
  1413 \noindent Finally, we only need to ``add'' up the equations
       
  1414 which correspond to a terminal state. In our running example,
       
  1415 this is just $Q_2$. Consequently, a regular expression
       
  1416 that recognises the same language as the DFA is
       
  1417 
       
  1418 \[
       
  1419 (b + a\,b + a\,a\,(a^*)\,b)^*\,a\,a\,(a)^*
       
  1420 \]
       
  1421 
       
  1422 \noindent You can somewhat crosscheck your solution by taking a string
       
  1423 the regular expression can match and and see whether it can be matched
       
  1424 by the DFA.  One string for example is $aaa$ and \emph{voila} this
       
  1425 string is also matched by the automaton.
       
  1426 
       
  1427 We should prove that Brzozowski's method really produces an equivalent
       
  1428 regular expression. But for the purposes of this module, we omit
       
  1429 this. I guess you are relieved.
       
  1430 
       
  1431 
       
  1432 \subsection*{Regular Languages}
   822 
  1433 
   823 Given the constructions in the previous sections we obtain 
  1434 Given the constructions in the previous sections we obtain 
   824 the following overall picture:
  1435 the following overall picture:
   825 
  1436 
   826 \begin{center}
  1437 \begin{center}
   842 Although we did not prove this fact. Similarly by going from
  1453 Although we did not prove this fact. Similarly by going from
   843 DFAs to regular expressions, we can make sure for every DFA 
  1454 DFAs to regular expressions, we can make sure for every DFA 
   844 there exists a regular expression that can recognise the same
  1455 there exists a regular expression that can recognise the same
   845 language. Again we did not prove this fact. 
  1456 language. Again we did not prove this fact. 
   846 
  1457 
   847 The interesting conclusion is that automata and regular 
  1458 The fundamental conclusion we can draw is that automata and regular
   848 expressions can recognise the same set of languages:
  1459 expressions can recognise the same set of languages:
   849 
  1460 
   850 \begin{quote} A language is \emph{regular} iff there exists a
  1461 \begin{quote} A language is \emph{regular} iff there exists a
   851 regular expression that recognises all its strings.
  1462 regular expression that recognises all its strings.
   852 \end{quote}
  1463 \end{quote}
   855 
  1466 
   856 \begin{quote} A language is \emph{regular} iff there exists an
  1467 \begin{quote} A language is \emph{regular} iff there exists an
   857 automaton that recognises all its strings.
  1468 automaton that recognises all its strings.
   858 \end{quote}
  1469 \end{quote}
   859 
  1470 
   860 \noindent So for deciding whether a string is recognised by a
  1471 \noindent Note that this is not a stement for a particular language
   861 regular expression, we could use our algorithm based on
  1472 (that is a particular set of strings), but about a large class of
   862 derivatives or NFAs or DFAs. But let us quickly look at what
  1473 languages, namely the regular ones.
   863 the differences mean in computational terms. Translating a
  1474 
   864 regular expression into a NFA gives us an automaton that has
  1475 As a consequence for deciding whether a string is recognised by a
   865 $O(n)$ nodes---that means the size of the NFA grows linearly
  1476 regular expression, we could use our algorithm based on derivatives or
   866 with the size of the regular expression. The problem with NFAs
  1477 NFAs or DFAs. But let us quickly look at what the differences mean in
   867 is that the problem of deciding whether a string is accepted
  1478 computational terms. Translating a regular expression into a NFA gives
   868 or not is computationally not cheap. Remember with NFAs we
  1479 us an automaton that has $O(n)$ states---that means the size of the
   869 have potentially many next states even for the same input and
  1480 NFA grows linearly with the size of the regular expression. The
   870 also have the silent $\epsilon$-transitions. If we want to
  1481 problem with NFAs is that the problem of deciding whether a string is
   871 find a path from the starting state of an NFA to an accepting
  1482 accepted or not is computationally not cheap. Remember with NFAs we
   872 state, we need to consider all possibilities. In Ruby and
  1483 have potentially many next states even for the same input and also
   873 Python this is done by a depth-first search, which in turn
  1484 have the silent $\epsilon$-transitions. If we want to find a path from
   874 means that if a ``wrong'' choice is made, the algorithm has to
  1485 the starting state of a NFA to an accepting state, we need to consider
   875 backtrack and thus explore all potential candidates. This is
  1486 all possibilities. In Ruby, Python and Java this is done by a
   876 exactly the reason why Ruby and Python are so slow for evil
  1487 depth-first search, which in turn means that if a ``wrong'' choice is
   877 regular expressions. An alternative to the potentially slow
  1488 made, the algorithm has to backtrack and thus explore all potential
   878 depth-first search is to explore the search space in a
  1489 candidates. This is exactly the reason why Ruby, Python and Java are
   879 breadth-first fashion, but this might incur a big memory
  1490 so slow for evil regular expressions. An alternative to the
   880 penalty.  
  1491 potentially slow depth-first search is to explore the search space in
   881 
  1492 a breadth-first fashion, but this might incur a big memory penalty.
   882 To avoid the problems with NFAs, we can translate them 
  1493 
   883 into DFAs. With DFAs the problem of deciding whether a
  1494 To avoid the problems with NFAs, we can translate them into DFAs. With
   884 string is recognised or not is much simpler, because in
  1495 DFAs the problem of deciding whether a string is recognised or not is
   885 each state it is completely determined what the next
  1496 much simpler, because in each state it is completely determined what
   886 state will be for a given input. So no search is needed.
  1497 the next state will be for a given input. So no search is needed.  The
   887 The problem with this is that the translation to DFAs
  1498 problem with this is that the translation to DFAs can explode
   888 can explode exponentially the number of states. Therefore when 
  1499 exponentially the number of states. Therefore when this route is
   889 this route is taken, we definitely need to minimise the
  1500 taken, we definitely need to minimise the resulting DFAs in order to
   890 resulting DFAs in order to have an acceptable memory 
  1501 have an acceptable memory and runtime behaviour. But remember the
   891 and runtime behaviour. But remember the subset construction
  1502 subset construction in the worst case explodes the number of states by
   892 in the worst case explodes the number of states by $2^n$.
  1503 $2^n$.  Effectively also the translation to DFAs can incur a big
   893 Effectively also the translation to DFAs can incur a big
       
   894 runtime penalty.
  1504 runtime penalty.
   895 
  1505 
   896 But this does not mean that everything is bad with automata.
  1506 But this does not mean that everything is bad with automata.  Recall
   897 Recall the problem of finding a regular expressions for the
  1507 the problem of finding a regular expressions for the language that is
   898 language that is \emph{not} recognised by a regular
  1508 \emph{not} recognised by a regular expression. In our implementation
   899 expression. In our implementation we added explicitly such a
  1509 we added explicitly such a regular expressions because they are useful
   900 regular expressions because they are useful for recognising
  1510 for recognising comments. But in principle we did not need to. The
   901 comments. But in principle we did not need to. The argument
  1511 argument for this is as follows: take a regular expression, translate
   902 for this is as follows: take a regular expression, translate
       
   903 it into a NFA and then a DFA that both recognise the same
  1512 it into a NFA and then a DFA that both recognise the same
   904 language. Once you have the DFA it is very easy to construct
  1513 language. Once you have the DFA it is very easy to construct the
   905 the automaton for the language not recognised by an DFA. If
  1514 automaton for the language not recognised by a DFA. If the DFA is
   906 the DFA is completed (this is important!), then you just need
  1515 completed (this is important!), then you just need to exchange the
   907 to exchange the accepting and non-accepting states. You can
  1516 accepting and non-accepting states. You can then translate this DFA
   908 then translate this DFA back into a regular expression and
  1517 back into a regular expression and that will be the regular expression
   909 that will be the regular expression that can match all strings
  1518 that can match all strings the original regular expression could
   910 the original regular expression could \emph{not} match.
  1519 \emph{not} match.
   911 
  1520 
   912 It is also interesting that not all languages are regular. The
  1521 It is also interesting that not all languages are regular. The most
   913 most well-known example of a language that is not regular
  1522 well-known example of a language that is not regular consists of all
   914 consists of all the strings of the form
  1523 the strings of the form
   915 
  1524 
   916 \[a^n\,b^n\]
  1525 \[a^n\,b^n\]
   917 
  1526 
   918 \noindent meaning strings that have the same number of $a$s
  1527 \noindent meaning strings that have the same number of $a$s and
   919 and $b$s. You can try, but you cannot find a regular
  1528 $b$s. You can try, but you cannot find a regular expression for this
   920 expression for this language and also not an automaton. One
  1529 language and also not an automaton. One can actually prove that there
   921 can actually prove that there is no regular expression nor
  1530 is no regular expression nor automaton for this language, but again
   922 automaton for this language, but again that would lead us too
  1531 that would lead us too far afield for what we want to do in this
   923 far afield for what we want to do in this module.
  1532 module.
   924 
  1533 
   925 \section*{Further Reading}
  1534 
   926 
  1535 \subsection*{Where Have Derivatives Gone?}
   927 Compare what a ``human expert'' would create as an automaton for the
  1536 
   928 regular expression $a (b + c)^*$ and what the Thomson
  1537 By now you are probably fed up with this text. It is now way too long
   929 algorithm generates.
  1538 for one lecture, but there is still one aspect of the
       
  1539 automata-regular-expression-connection I like to describe. Perhaps by
       
  1540 now you are asking yourself: Where have the derivatives gone? Did we
       
  1541 just forget them?  Well, they have a place in the picture of
       
  1542 calculating a DFA from the regular expression.
       
  1543 
       
  1544 To be done
       
  1545 
       
  1546 \begin{center}
       
  1547 \begin{tikzpicture}
       
  1548   [level distance=25mm,very thick,auto,
       
  1549    level 1/.style={sibling distance=30mm},
       
  1550    level 2/.style={sibling distance=15mm},
       
  1551    every node/.style={minimum size=30pt,
       
  1552                     inner sep=0pt,circle,draw=blue!50,very thick,
       
  1553                     fill=blue!20}]
       
  1554   \node {$r$} [grow=right]
       
  1555   child[->] {node (cn) {$d_{c_n}$}
       
  1556     child { node {$dd_{c_nc_n}$}}
       
  1557     child { node {$dd_{c_nc_1}$}}
       
  1558     %edge from parent node[left] {$c_n$}
       
  1559   }
       
  1560   child[->] {node (c1) {$d_{c_1}$}
       
  1561     child { node {$dd_{c_1c_n}$}}
       
  1562     child { node {$dd_{c_1c_1}$}}
       
  1563     %edge from parent node[left] {$c_1$}
       
  1564   };
       
  1565   %%\draw (cn) -- (c1) node {\vdots}; 
       
  1566 \end{tikzpicture}  
       
  1567 \end{center}
       
  1568 
       
  1569 %\section*{Further Reading}
       
  1570 
       
  1571 %Compare what a ``human expert'' would create as an automaton for the
       
  1572 %regular expression $a\cdot (b + c)^*$ and what the Thomson
       
  1573 %algorithm generates.
   930 
  1574 
   931 %http://www.inf.ed.ac.uk/teaching/courses/ct/
  1575 %http://www.inf.ed.ac.uk/teaching/courses/ct/
   932 \end{document}
  1576 \end{document}
   933 
  1577 
   934 %%% Local Variables: 
  1578 %%% Local Variables: 
   935 %%% mode: latex
  1579 %%% mode: latex
   936 %%% TeX-master: t
  1580 %%% TeX-master: t
   937 %%% End: 
  1581 %%% End: 
       
  1582 
       
  1583