handouts/ho03.tex
author Christian Urban <urbanc@in.tum.de>
Mon, 08 Oct 2018 11:35:04 +0100 (2018-10-08)
changeset 573 711bbc480998
parent 572 4a1739f256fd
child 578 6e5e3adc9eb1
permissions -rw-r--r--
updated
\documentclass{article}
\usepackage{../style}
\usepackage{../langs}
\usepackage{../graphics}


\begin{document}
\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017}

\section*{Handout 3 (Finite Automata)}


Every formal language and compiler course I know of bombards you first
with automata and then to a much, much smaller extend with regular
expressions. As you can see, this course is turned upside down:
regular expressions come first. The reason is that regular expressions
are easier to reason about and the notion of derivatives, although
already quite old, only became more widely known rather
recently. Still, let us in this lecture have a closer look at automata
and their relation to regular expressions. This will help us with
understanding why the regular expression matchers in Python, Ruby and
Java are so slow with certain regular expressions. On the way we will
also see what are the limitations of regular
expressions. Unfortunately, they cannot be used for \emph{everything}.


\subsection*{Deterministic Finite Automata}

Lets start\ldots the central definition is:\medskip

\noindent 
A \emph{deterministic finite automaton} (DFA), say $A$, is
given by a five-tuple written ${\cal A}(\varSigma, Qs, Q_0, F, \delta)$ where

\begin{itemize}
\item $\varSigma$ is an alphabet,  
\item $Qs$ is a finite set of states,
\item $Q_0 \in Qs$ is the start state,
\item $F \subseteq Qs$ are the accepting states, and
\item $\delta$ is the transition function.
\end{itemize}

\noindent I am sure you have seen this definition already
before. The transition function determines how to ``transition'' from
one state to the next state with respect to a character. We have the
assumption that these transition functions do not need to be defined
everywhere: so it can be the case that given a character there is no
next state, in which case we need to raise a kind of ``failure
exception''.  That means actually we have \emph{partial} functions as
transitions---see the Scala implementation for DFAs later on.  A
typical example of a DFA is

\begin{center}
\begin{tikzpicture}[>=stealth',very thick,auto,
                    every state/.style={minimum size=0pt,
                    inner sep=2pt,draw=blue!50,very thick,
                    fill=blue!20},scale=2]
\node[state,initial]  (Q_0)  {$Q_0$};
\node[state] (Q_1) [right=of Q_0] {$Q_1$};
\node[state] (Q_2) [below right=of Q_0] {$Q_2$};
\node[state] (Q_3) [right=of Q_2] {$Q_3$};
\node[state, accepting] (Q_4) [right=of Q_1] {$Q_4$};
\path[->] (Q_0) edge node [above]  {$a$} (Q_1);
\path[->] (Q_1) edge node [above]  {$a$} (Q_4);
\path[->] (Q_4) edge [loop right] node  {$a, b$} ();
\path[->] (Q_3) edge node [right]  {$a$} (Q_4);
\path[->] (Q_2) edge node [above]  {$a$} (Q_3);
\path[->] (Q_1) edge node [right]  {$b$} (Q_2);
\path[->] (Q_0) edge node [above]  {$b$} (Q_2);
\path[->] (Q_2) edge [loop left] node  {$b$} ();
\path[->] (Q_3) edge [bend left=95, looseness=1.3] node [below]  {$b$} (Q_0);
\end{tikzpicture}
\end{center}

\noindent In this graphical notation, the accepting state $Q_4$ is
indicated with double circles. Note that there can be more than one
accepting state. It is also possible that a DFA has no accepting
state at all, or that the starting state is also an accepting
state. In the case above the transition function is defined everywhere
and can also be given as a table as follows:

\[
\begin{array}{lcl}
(Q_0, a) &\rightarrow& Q_1\\
(Q_0, b) &\rightarrow& Q_2\\
(Q_1, a) &\rightarrow& Q_4\\
(Q_1, b) &\rightarrow& Q_2\\
(Q_2, a) &\rightarrow& Q_3\\
(Q_2, b) &\rightarrow& Q_2\\
(Q_3, a) &\rightarrow& Q_4\\
(Q_3, b) &\rightarrow& Q_0\\
(Q_4, a) &\rightarrow& Q_4\\
(Q_4, b) &\rightarrow& Q_4\\
\end{array}
\]

\noindent
Please check that this table represents the same transition function
as the graph above.

We need to define the notion of what language is accepted by
an automaton. For this we lift the transition function
$\delta$ from characters to strings as follows:

\[
\begin{array}{lcl}
\widehat{\delta}(q, [])        & \dn & q\\
\widehat{\delta}(q, c\!::\!s) & \dn & \widehat{\delta}(\delta(q, c), s)\\
\end{array}
\]

\noindent This lifted transition function is often called
\emph{delta-hat}. Given a string, we start in the starting state and
take the first character of the string, follow to the next state, then
take the second character and so on. Once the string is exhausted and
we end up in an accepting state, then this string is accepted by the
automaton. Otherwise it is not accepted. This also means that if along
the way we hit the case where the transition function $\delta$ is not
defined, we need to raise an error. In our implementation we will deal
with this case elegantly by using Scala's \texttt{Try}.  Summing up: a
string $s$ is in the \emph{language accepted by the automaton} ${\cal
  A}(\varSigma, Q, Q_0, F, \delta)$ iff

\[
\widehat{\delta}(Q_0, s) \in F 
\]

\noindent I let you think about a definition that describes the set of
all strings accepted by a deterministic finite automaton.

\begin{figure}[p]
\small
\lstinputlisting[numbers=left]{../progs/display/dfa.scala}
\caption{An implementation of DFAs in Scala using partial functions.
  Note some subtleties: \texttt{deltas} implements the delta-hat
  construction by lifting the (partial) transition  function to lists
  of characters. Since \texttt{delta} is given as a partial function,
  it can obviously go ``wrong'' in which case the \texttt{Try} in
  \texttt{accepts} catches the error and returns \texttt{false}---that
  means the string is not accepted.  The example \texttt{delta} in
  Line 28--38 implements the DFA example shown earlier in the
  handout.\label{dfa}}
\end{figure}

My take on an implementation of DFAs in Scala is given in
Figure~\ref{dfa}. As you can see, there are many features of the
mathematical definition that are quite closely reflected in the
code. In the DFA-class, there is a starting state, called
\texttt{start}, with the polymorphic type \texttt{A}.  There is a
partial function \texttt{delta} for specifying the transitions---these
partial functions take a state (of polymorphic type \texttt{A}) and an
input (of polymorphic type \texttt{C}) and produce a new state (of
type \texttt{A}). For the moment it is OK to assume that \texttt{A} is
some arbitrary type for states and the input is just characters.  (The
reason for not having concrete types, but polymorphic types for the
states and the input of DFAs will become clearer later on.)

The DFA-class has also an argument for specifying final states. In the
implementation it is not a set of states, as in the mathematical
definition, but a function from states to booleans (this function is
supposed to return true whenever a state is final; false
otherwise). While this boolean function is different from the sets of
states, Scala allows us to use sets for such functions (see Line 40 where
the DFA is initialised). Again it will become clear later on why I use
functions for final states, rather than sets.

The most important point in the implementation is that I use Scala's
partial functions for representing the transitions; alternatives would
have been \texttt{Maps} or even \texttt{Lists}. One of the main
advantages of using partial functions is that transitions can be quite
nicely defined by a series of \texttt{case} statements (see Lines 28
-- 38 for an example). If you need to represent an automaton with a
sink state (catch-all-state), you can use Scala's pattern matching and
write something like

{\small\begin{lstlisting}[language=Scala]
abstract class State
...
case object Sink extends State

val delta : (State, Char) :=> State = 
  { case (S0, 'a') => S1
    case (S1, 'a') => S2
    case _ => Sink
  }
\end{lstlisting}}  

\noindent I let you think what the corresponding DFA looks like in the
graph notation. Also, I suggest you to tinker with the Scala code in
order to define the DFA that does not accept any string at all. 

Finally, I let you ponder whether this is a good implementation of
DFAs or not. In doing so I hope you notice that the $\varSigma$ and
$Qs$ components (the alphabet and the set of \emph{finite} states,
respectively) are missing from the class definition. This means that
the implementation allows you to do some ``fishy'' things you are not
meant to do with DFAs. Which fishy things could that be?



\subsection*{Non-Deterministic Finite Automata}

Remember we want to find out what the relation is between regular
expressions and automata. To do this with DFAs is a bit unwieldy.
While with DFAs it is always clear that given a state and a character
what the next state is (potentially none), it will be convenient to
relax this restriction. That means we allow states to have several
potential successor states. We even allow more than one starting
state. The resulting construction is called a \emph{Non-Deterministic
  Finite Automaton} (NFA) given also as a five-tuple ${\cal
  A}(\varSigma, Qs, Q_{0s}, F, \rho)$ where

\begin{itemize}
\item $\varSigma$ is an alphabet,  
\item $Qs$ is a finite set of states
\item $Q_{0s}$ is a set of start states ($Q_{0s} \subseteq Qs$)
\item $F$ are some accepting states with $F \subseteq Qs$, and
\item $\rho$ is a transition relation.
\end{itemize}

\noindent
A typical example of a NFA is

% A NFA for (ab* + b)*a
\begin{center}
\begin{tikzpicture}[>=stealth',very thick, auto,
    every state/.style={minimum size=0pt,inner sep=3pt,
      draw=blue!50,very thick,fill=blue!20},scale=2]
\node[state,initial]  (Q_0)  {$Q_0$};
\node[state] (Q_1) [right=of Q_0] {$Q_1$};
\node[state, accepting] (Q_2) [right=of Q_1] {$Q_2$};
\path[->] (Q_0) edge [loop above] node  {$b$} ();
\path[<-] (Q_0) edge node [below]  {$b$} (Q_1);
\path[->] (Q_0) edge [bend left] node [above]  {$a$} (Q_1);
\path[->] (Q_0) edge [bend right] node [below]  {$a$} (Q_2);
\path[->] (Q_1) edge [loop above] node  {$a,b$} ();
\path[->] (Q_1) edge node  [above] {$a$} (Q_2);
\end{tikzpicture}
\end{center}

\noindent
This NFA happens to have only one starting state, but in general there
could be more than one.  Notice that in state $Q_0$ we might go to
state $Q_1$ \emph{or} to state $Q_2$ when receiving an $a$. Similarly
in state $Q_1$ and receiving an $a$, we can stay in state $Q_1$
\emph{or} go to $Q_2$. This kind of choice is not allowed with
DFAs. The downside of this choice in NFAs is that when it comes to
deciding whether a string is accepted by a NFA we potentially have to
explore all possibilities. I let you think which strings the above NFA
accepts.


There are a number of additional points you should note about
NFAs. Every DFA is a NFA, but not vice versa. The $\rho$ in NFAs is a
transition \emph{relation} (DFAs have a transition function). The
difference between a function and a relation is that a function has
always a single output, while a relation gives, roughly speaking,
several outputs. Look again at the NFA above: if you are currently in
the state $Q_1$ and you read a character $b$, then you can transition
to either $Q_0$ \emph{or} $Q_2$. Which route, or output, you take is
not determined.  This non-determinism can be represented by a
relation.

My implementation of NFAs in Scala is shown in Figure~\ref{nfa}.
Perhaps interestingly, I do not actually use relations for my NFAs,
but use transition functions that return sets of states.  DFAs have
partial transition functions that return a single state; my NFAs
return a set of states. I let you think about this representation for
NFA-transitions and how it corresponds to the relations used in the
mathematical definition of NFAs. An example of a transition function
in Scala for the NFA shown above is

{\small\begin{lstlisting}[language=Scala]
val nfa_delta : (State, Char) :=> Set[State] = 
  { case (Q0, 'a') => Set(Q1, Q2)
    case (Q0, 'b') => Set(Q0)
    case (Q1, 'a') => Set(Q1, Q2)
    case (Q1, 'b') => Set(Q0, Q1) }
\end{lstlisting}}  

Like in the mathematical definition, \texttt{starts} is in
NFAs a set of states; \texttt{fins} is again a function from states to
booleans. The \texttt{next} function calculates the set of next states
reachable from a single state \texttt{q} by a character~\texttt{c}. In
case there is no such state---the partial transition function is
undefined---the empty set is returned (see function
\texttt{applyOrElse} in Lines 9 and 10). The function \texttt{nexts}
just lifts this function to sets of states.
 
\begin{figure}[p]
\small
\lstinputlisting[numbers=left]{../progs/display/nfa.scala}
\caption{A Scala implementation of NFAs using partial functions.
  Notice that the function \texttt{accepts} implements the
  acceptance of a string in a breadth-first search fashion. This can be a costly
  way of deciding whether a string is accepted or not in applications that need to handle
  large NFAs and large inputs.\label{nfa}}
\end{figure}

Look very careful at the \texttt{accepts} and \texttt{deltas}
functions in NFAs and remember that when accepting a string by a NFA
we might have to explore all possible transitions (recall which state
to go to is not unique anymore with NFAs\ldots{}we need to explore
potentially all next states). The implementation achieves this
exploration through a \emph{breadth-first search}. This is fine for
small NFAs, but can lead to real memory problems when the NFAs are
bigger and larger strings need to be processed. As result, some
regular expression matching engines resort to a \emph{depth-first
  search} with \emph{backtracking} in unsuccessful cases. In our
implementation we can implement a depth-first version of
\texttt{accepts} using Scala's \texttt{exists}-function as follows:


{\small\begin{lstlisting}[language=Scala]
def search(q: A, s: List[C]) : Boolean = s match {
  case Nil => fins(q)
  case c::cs => next(q, c).exists(search(_, cs)) 
}

def accepts2(s: List[C]) : Boolean = 
  starts.exists(search(_, s))
\end{lstlisting}}

\noindent
This depth-first way of exploration seems to work quite efficiently in
many examples and is much less of a strain on memory. The problem is
that the backtracking can get ``catastrophic'' in some
examples---remember the catastrophic backtracking from earlier
lectures. This depth-first search with backtracking is the reason for
the abysmal performance of some regular expression matchings in Java,
Ruby and Python. I like to show you this in the next two sections.


\subsection*{Epsilon NFAs}

In order to get an idea what calculations are performed by Java \&
friends, we need a method for transforming a regular expression into
an automaton. This automaton should accept exactly those strings that
are accepted by the regular expression.  The simplest and most
well-known method for this is called \emph{Thompson Construction},
after the Turing Award winner Ken Thompson. This method is by
recursion over regular expressions and depends on the non-determinism
in NFAs described in the previous section. You will see shortly why
this construction works well with NFAs, but is not so straightforward
with DFAs.

Unfortunately we are still one step away from our intended target
though---because this construction uses a version of NFAs that allows
``silent transitions''. The idea behind silent transitions is that
they allow us to go from one state to the next without having to
consume a character. We label such silent transition with the letter
$\epsilon$ and call the automata $\epsilon$NFAs. Two typical examples
of $\epsilon$NFAs are:


\begin{center}
\begin{tabular}[t]{c@{\hspace{9mm}}c}
\begin{tikzpicture}[>=stealth',very thick,
    every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]
\node[state,initial]  (Q_0)  {$Q_0$};
\node[state] (Q_1) [above=of Q_0] {$Q_1$};
\node[state, accepting] (Q_2) [below=of Q_0] {$Q_2$};
\path[->] (Q_0) edge node [left]  {$\epsilon$} (Q_1);
\path[->] (Q_0) edge node [left]  {$\epsilon$} (Q_2);
\path[->] (Q_0) edge [loop right] node  {$a$} ();
\path[->] (Q_1) edge [loop right] node  {$a$} ();
\path[->] (Q_2) edge [loop right] node  {$b$} ();
\end{tikzpicture} &

\raisebox{20mm}{
\begin{tikzpicture}[>=stealth',very thick,
    every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]
\node[state,initial]  (r_1)  {$R_1$};
\node[state] (r_2) [above=of r_1] {$R_2$};
\node[state, accepting] (r_3) [right=of r_1] {$R_3$};
\path[->] (r_1) edge node [below]  {$b$} (r_3);
\path[->] (r_2) edge [bend left] node [above]  {$a$} (r_3);
\path[->] (r_1) edge [bend left] node  [left] {$\epsilon$} (r_2);
\path[->] (r_2) edge [bend left] node  [right] {$a$} (r_1);
\end{tikzpicture}}
\end{tabular}
\end{center}

\noindent
Consider the $\epsilon$NFA on the left-hand side: the
$\epsilon$-transitions mean you do not have to ``consume'' any part of
the input string, but ``silently'' change to a different state. In
this example, if you are in the starting state $Q_0$, you can silently
move either to $Q_1$ or $Q_2$. You can see that once you are in $Q_1$,
respectively $Q_2$, you cannot ``go back'' to the other states. So it
seems allowing $\epsilon$-transitions is a rather substantial
extension to NFAs. On first appearances, $\epsilon$-transitions might
even look rather strange, or even silly. To start with, silent
transitions make the decision whether a string is accepted by an
automaton even harder: with $\epsilon$NFAs we have to look whether we
can do first some $\epsilon$-transitions and then do a
``proper''-transition; and after any ``proper''-transition we again
have to check whether we can do again some silent transitions. Even
worse, if there is a silent transition pointing back to the same
state, then we have to be careful our decision procedure for strings
does not loop (remember the depth-first search for exploring all
states).

The obvious question is: Do we get anything in return for this hassle
with silent transitions?  Well, we still have to work for it\ldots
unfortunately.  If we were to follow the many textbooks on the
subject, we would now start with defining what $\epsilon$NFAs
are---that would require extending the transition relation of
NFAs. Next, we would show that the $\epsilon$NFAs are equivalent to
NFAs and so on. Once we have done all this on paper, we would need to
implement $\epsilon$NFAs. Lets try to take a shortcut instead. We are
not really interested in $\epsilon$NFAs; they are only a convenient
tool for translating regular expressions into automata. So we are not
going to implementing them explicitly, but translate them immediately
into NFAs (in a sense $\epsilon$NFAs are just a convenient API for
lazy people ;o).  How does the translation work? Well we have to find
all transitions of the form

\[
q\stackrel{\epsilon}{\longrightarrow}\ldots\stackrel{\epsilon}{\longrightarrow}
\;\stackrel{a}{\longrightarrow}\;
\stackrel{\epsilon}{\longrightarrow}\ldots\stackrel{\epsilon}{\longrightarrow} q'
\]

\noindent where somewhere in the ``middle'' is an $a$-transition. We
replace them with $q \stackrel{a}{\longrightarrow} q'$. Doing this to
the $\epsilon$NFA on the right-hand side above gives the NFA

\begin{center}
\begin{tikzpicture}[>=stealth',very thick,
    every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]
\node[state,initial]  (r_1)  {$R_1$};
\node[state] (r_2) [above=of r_1] {$R_2$};
\node[state, accepting] (r_3) [right=of r_1] {$R_3$};
\path[->] (r_1) edge node [above]  {$b$} (r_3);
\path[->] (r_2) edge [bend left] node [above]  {$a$} (r_3);
\path[->] (r_1) edge [bend left] node  [left] {$a$} (r_2);
\path[->] (r_2) edge [bend left] node  [right] {$a$} (r_1);
\path[->] (r_1) edge [loop below] node  {$a$} ();
\path[->] (r_1) edge [bend right] node [below]  {$a$} (r_3);
\end{tikzpicture}
\end{center}

\noindent where the single $\epsilon$-transition is replaced by
three additional $a$-transitions. Please do the calculations yourself
and verify that I did not forget any transition.

So in what follows, whenever we are given an $\epsilon$NFA we will
replace it by an equivalent NFA. The Scala code for this translation
is given in Figure~\ref{enfa}. The main workhorse in this code is a
function that calculates a fixpoint of function (Lines 5--10). This
function is used for ``discovering'' which states are reachable by
$\epsilon$-transitions. Once no new state is discovered, a fixpoint is
reached. This is used for example when calculating the starting states
of an equivalent NFA (see Line 36): we start with all starting states
of the $\epsilon$NFA and then look for all additional states that can
be reached by $\epsilon$-transitions. We keep on doing this until no
new state can be reached. This is what the $\epsilon$-closure, named
in the code \texttt{ecl}, calculates. Similarly, an accepting state of
the NFA is when we can reach an accepting state of the $\epsilon$NFA
by $\epsilon$-transitions.


\begin{figure}[p]
\small
\lstinputlisting[numbers=left]{../progs/display/enfa.scala}

\caption{A Scala function that translates $\epsilon$NFA into NFAs. The
  transition function of $\epsilon$NFA takes as input an \texttt{Option[C]}.
  \texttt{None} stands for an $\epsilon$-transition; \texttt{Some(c)}
  for a ``proper'' transition consuming a character. The functions in
  Lines 18--26 calculate
  all states reachable by one or more $\epsilon$-transition for a given
  set of states. The NFA is constructed in Lines 36--38.
  Note the interesting commands in Lines 5 and 6: their purpose is
  to ensure that \texttt{fixpT} is the tail-recursive version of
  the fixpoint construction; otherwise we would quickly get a
  stack-overflow exception, even on small examples, due to limitations
  of the JVM.
  \label{enfa}}
\end{figure}

Also look carefully how the transitions of $\epsilon$NFAs are
implemented. The additional possibility of performing silent
transitions is encoded by using \texttt{Option[C]} as the type for the
``input''. The \texttt{Some}s stand for ``proper'' transitions where
a character is consumed; \texttt{None} stands for
$\epsilon$-transitions. The transition functions for the two
$\epsilon$NFAs from the beginning of this section can be defined as

{\small\begin{lstlisting}[language=Scala]
val enfa_trans1 : (State, Option[Char]) :=> Set[State] = 
  { case (Q0, Some('a')) => Set(Q0)
    case (Q0, None) => Set(Q1, Q2)
    case (Q1, Some('a')) => Set(Q1)
    case (Q2, Some('b')) => Set(Q2) }

val enfa_trans2 : (State, Option[Char]) :=> Set[State] = 
  { case (R1, Some('b')) => Set(R3)
    case (R1, None) => Set(R2)
    case (R2, Some('a')) => Set(R1, R3) }
\end{lstlisting}}

\noindent
I hope you agree now with my earlier statement that the $\epsilon$NFAs
are just an API for NFAs.

\subsection*{Thompson Construction}

Having the translation of $\epsilon$NFAs to NFAs in place, we can
finally return to the problem of translating regular expressions into
equivalent NFAs. Recall that by equivalent we mean that the NFAs
recognise the same language. Consider the simple regular expressions
$\ZERO$, $\ONE$ and $c$. They can be translated into equivalent NFAs
as follows:

\begin{equation}\mbox{
\begin{tabular}[t]{l@{\hspace{10mm}}l}
\raisebox{1mm}{$\ZERO$} & 
\begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
\node[state, initial]  (Q_0)  {$\mbox{}$};
\end{tikzpicture}\\\\
\raisebox{1mm}{$\ONE$} & 
\begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
\node[state, initial, accepting]  (Q_0)  {$\mbox{}$};
\end{tikzpicture}\\\\
\raisebox{3mm}{$c$} & 
\begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
\node[state, initial]  (Q_0)  {$\mbox{}$};
\node[state, accepting]  (Q_1)  [right=of Q_0] {$\mbox{}$};
\path[->] (Q_0) edge node [below]  {$c$} (Q_1);
\end{tikzpicture}\\
\end{tabular}}\label{simplecases}
\end{equation}

\noindent
I let you think whether the NFAs can match exactly those strings the
regular expressions can match. To do this translation in code we need
a way to construct states ``programatically''...and as an additional
constraint Scala needs to recognise that these states are being distinct.
For this I implemented in Figure~\ref{thompson1} a class
\texttt{TState} that includes a counter and a companion object that
increases this counter whenever a new state is created.\footnote{You might
  have to read up what \emph{companion objects} do in Scala.}

\begin{figure}[p]
\small
\lstinputlisting[numbers=left]{../progs/display/thompson1.scala}
\caption{The first part of the Thompson Construction. Lines 7--16
  implement a way of how to create new states that are all
  distinct by virtue of a counter. This counter is
  increased in the companion object of \texttt{TState}
  whenever a new state is created. The code in Lines 24--40
  constructs NFAs for the simple regular expressions $\ZERO$, $\ONE$ and $c$.
  Compare this code with the pictures given in \eqref{simplecases} on
  Page~\pageref{simplecases}.
  \label{thompson1}}
\end{figure}

\begin{figure}[p]
\small
\lstinputlisting[numbers=left]{../progs/display/thompson2.scala}
\caption{The second part of the Thompson Construction implementing
  the composition of NFAs according to $\cdot$, $+$ and ${}^*$.
  The implicit class about rich partial functions
  implements the infix operation \texttt{+++} which
  combines an $\epsilon$NFA transition with a NFA transition
  (both are given as partial functions---but with different type!).\label{thompson2}}
\end{figure}

The case for the sequence regular expression $r_1 \cdot r_2$ is a bit more
complicated: Say, we are given by recursion two NFAs representing the regular
expressions $r_1$ and $r_2$ respectively.

\begin{center}
\begin{tikzpicture}[node distance=3mm,
    >=stealth',very thick,
    every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
\node[state, initial]  (Q_0)  {$\mbox{}$};
\node[state, initial]  (Q_01) [below=1mm of Q_0] {$\mbox{}$};  
\node[state, initial]  (Q_02) [above=1mm of Q_0] {$\mbox{}$};  
\node (R_1)  [right=of Q_0] {$\ldots$};
\node[state, accepting]  (T_1)  [right=of R_1] {$\mbox{}$};
\node[state, accepting]  (T_2)  [above=of T_1] {$\mbox{}$};
\node[state, accepting]  (T_3)  [below=of T_1] {$\mbox{}$};

\node (A_0)  [right=2.5cm of T_1] {$\mbox{}$};
\node[state, initial]  (A_01)  [above=1mm of A_0] {$\mbox{}$};
\node[state, initial]  (A_02)  [below=1mm of A_0] {$\mbox{}$};

\node (b_1)  [right=of A_0] {$\ldots$};
\node[state, accepting]  (c_1)  [right=of b_1] {$\mbox{}$};
\node[state, accepting]  (c_2)  [above=of c_1] {$\mbox{}$};
\node[state, accepting]  (c_3)  [below=of c_1] {$\mbox{}$};
\begin{pgfonlayer}{background}
  \node (1) [rounded corners, inner sep=1mm, thick,
    draw=black!60, fill=black!20, fit= (Q_0) (R_1) (T_1) (T_2) (T_3)] {};
  \node (2) [rounded corners, inner sep=1mm, thick,
    draw=black!60, fill=black!20, fit= (A_0) (b_1) (c_1) (c_2) (c_3)] {};
\node [yshift=2mm] at (1.north) {$r_1$};
\node [yshift=2mm] at (2.north) {$r_2$};
\end{pgfonlayer}
\end{tikzpicture}
\end{center}

\noindent The first NFA has some accepting states and the second some
starting states. We obtain an $\epsilon$NFA for $r_1\cdot r_2$ by
connecting the accepting states of the first NFA with
$\epsilon$-transitions to the starting states of the second
automaton. By doing so we make the accepting states of the first NFA
to be non-accepting like so:

\begin{center}
\begin{tikzpicture}[node distance=3mm,
    >=stealth',very thick,
    every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
\node[state, initial]  (Q_0)  {$\mbox{}$};
\node[state, initial]  (Q_01) [below=1mm of Q_0] {$\mbox{}$};  
\node[state, initial]  (Q_02) [above=1mm of Q_0] {$\mbox{}$};  
\node (r_1)  [right=of Q_0] {$\ldots$};
\node[state]  (t_1)  [right=of r_1] {$\mbox{}$};
\node[state]  (t_2)  [above=of t_1] {$\mbox{}$};
\node[state]  (t_3)  [below=of t_1] {$\mbox{}$};

\node  (A_0)  [right=2.5cm of t_1] {$\mbox{}$};
\node[state]  (A_01)  [above=1mm of A_0] {$\mbox{}$};
\node[state]  (A_02)  [below=1mm of A_0] {$\mbox{}$};

\node (b_1)  [right=of A_0] {$\ldots$};
\node[state, accepting]  (c_1)  [right=of b_1] {$\mbox{}$};
\node[state, accepting]  (c_2)  [above=of c_1] {$\mbox{}$};
\node[state, accepting]  (c_3)  [below=of c_1] {$\mbox{}$};
\path[->] (t_1) edge (A_01);
\path[->] (t_2) edge node [above]  {$\epsilon$s} (A_01);
\path[->] (t_3) edge (A_01);
\path[->] (t_1) edge (A_02);
\path[->] (t_2) edge (A_02);
\path[->] (t_3) edge node [below]  {$\epsilon$s} (A_02);
 
\begin{pgfonlayer}{background}
  \node (3) [rounded corners, inner sep=1mm, thick,
    draw=black!60, fill=black!20, fit= (Q_0) (c_1) (c_2) (c_3)] {};
\node [yshift=2mm] at (3.north) {$r_1\cdot r_2$};
\end{pgfonlayer}
\end{tikzpicture}
\end{center}

\noindent The idea behind this construction is that the start of any
string is first recognised by the first NFA, then we silently change
to the second NFA; the ending of the string is recognised by the
second NFA...just like matching of a string by the regular expression
$r_1\cdot r_2$. The Scala code for this construction is given in
Figure~\ref{thompson2} in Lines 16--23. The starting states of the
$\epsilon$NFA are the starting states of the first NFA (corresponding
to $r_1$); the accepting function is the accepting function of the
second NFA (corresponding to $r_2$). The new transition function is
all the ``old'' transitions plus the $\epsilon$-transitions connecting
the accepting states of the first NFA to the starting states of the
first NFA (Lines 18 and 19). The $\epsilon$NFA is then immediately
translated in a NFA.


The case for the alternative regular expression $r_1 + r_2$ is
slightly different: We are given by recursion two NFAs representing
$r_1$ and $r_2$ respectively. Each NFA has some starting states and
some accepting states. We obtain a NFA for the regular expression $r_1
+ r_2$ by composing the transition functions (this crucially depends
on knowing that the states of each component NFA are distinct---recall
we implemented for this to hold some bespoke code for states). We also
need to combine the starting states and accepting functions
appropriately.

\begin{center}
\begin{tabular}[t]{ccc}
\begin{tikzpicture}[node distance=3mm,
    >=stealth',very thick,
    every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},
    baseline=(current bounding box.center)]
\node at (0,0)  (1)  {$\mbox{}$};
\node  (2)  [above=10mm of 1] {};
\node[state, initial]  (4)  [above=1mm of 2] {$\mbox{}$};
\node[state, initial]  (5)  [below=1mm of 2] {$\mbox{}$};
\node[state, initial]  (3)  [below=10mm of 1] {$\mbox{}$};

\node (a)  [right=of 2] {$\ldots\,$};
\node  (a1)  [right=of a] {$$};
\node[state, accepting]  (a2)  [above=of a1] {$\mbox{}$};
\node[state, accepting]  (a3)  [below=of a1] {$\mbox{}$};

\node (b)  [right=of 3] {$\ldots$};
\node[state, accepting]  (b1)  [right=of b] {$\mbox{}$};
\node[state, accepting]  (b2)  [above=of b1] {$\mbox{}$};
\node[state, accepting]  (b3)  [below=of b1] {$\mbox{}$};
\begin{pgfonlayer}{background}
\node (1) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (2) (a1) (a2) (a3)] {};
\node (2) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (3) (b1) (b2) (b3)] {};
\node [yshift=3mm] at (1.north) {$r_1$};
\node [yshift=3mm] at (2.north) {$r_2$};
\end{pgfonlayer}
\end{tikzpicture}
&
\mbox{}\qquad\tikz{\draw[>=stealth,line width=2mm,->] (0,0) -- (1, 0)}\quad\mbox{}
&
\begin{tikzpicture}[node distance=3mm,
    >=stealth',very thick,
    every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},
    baseline=(current bounding box.center)]
\node at (0,0) (1)  {$\mbox{}$};
\node (2)  [above=10mm of 1] {$$};
\node[state, initial]  (4)  [above=1mm of 2] {$\mbox{}$};
\node[state, initial]  (5)  [below=1mm of 2] {$\mbox{}$};
\node[state, initial]  (3)  [below=10mm of 1] {$\mbox{}$};

\node (a)  [right=of 2] {$\ldots\,$};
\node (a1)  [right=of a] {$$};
\node[state, accepting]  (a2)  [above=of a1] {$\mbox{}$};
\node[state, accepting]  (a3)  [below=of a1] {$\mbox{}$};

\node (b)  [right=of 3] {$\ldots$};
\node[state, accepting]  (b1)  [right=of b] {$\mbox{}$};
\node[state, accepting]  (b2)  [above=of b1] {$\mbox{}$};
\node[state, accepting]  (b3)  [below=of b1] {$\mbox{}$};

%\path[->] (1) edge node [above]  {$\epsilon$} (2);
%\path[->] (1) edge node [below]  {$\epsilon$} (3);

\begin{pgfonlayer}{background}
\node (3) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (1) (a2) (a3) (b2) (b3)] {};
\node [yshift=3mm] at (3.north) {$r_1+ r_2$};
\end{pgfonlayer}
\end{tikzpicture}
\end{tabular}
\end{center}

\noindent The code for this construction is in Figure~\ref{thompson2}
in Lines 25--33.

Finally for the $*$-case we have a NFA for $r$ and connect its
accepting states to a new starting state via
$\epsilon$-transitions. This new starting state is also an accepting
state, because $r^*$ can recognise the empty string.

\begin{center}
\begin{tabular}[b]{@{}ccc@{}}  
\begin{tikzpicture}[node distance=3mm,
    >=stealth',very thick,
    every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},
    baseline=(current bounding box.north)]
\node (2)  {$\mbox{}$};
\node[state, initial]  (4)  [above=1mm of 2] {$\mbox{}$};
\node[state, initial]  (5)  [below=1mm of 2] {$\mbox{}$};  
\node (a)  [right=of 2] {$\ldots$};
\node[state, accepting]  (a1)  [right=of a] {$\mbox{}$};
\node[state, accepting]  (a2)  [above=of a1] {$\mbox{}$};
\node[state, accepting]  (a3)  [below=of a1] {$\mbox{}$};
\begin{pgfonlayer}{background}
\node (1) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (2) (a1) (a2) (a3)] {};
\node [yshift=3mm] at (1.north) {$r$};
\end{pgfonlayer}
\end{tikzpicture}
&
\raisebox{-16mm}{\;\tikz{\draw[>=stealth,line width=2mm,->] (0,0) -- (1, 0)}}
&
\begin{tikzpicture}[node distance=3mm,
    >=stealth',very thick,
    every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},
    baseline=(current bounding box.north)]
\node at (0,0) [state, initial,accepting]  (1)  {$\mbox{}$};
\node (2)  [right=16mm of 1] {$\mbox{}$};
\node[state]  (4)  [above=1mm of 2] {$\mbox{}$};
\node[state]  (5)  [below=1mm of 2] {$\mbox{}$};  
\node (a)  [right=of 2] {$\ldots$};
\node[state]  (a1)  [right=of a] {$\mbox{}$};
\node[state]  (a2)  [above=of a1] {$\mbox{}$};
\node[state]  (a3)  [below=of a1] {$\mbox{}$};
\path[->] (1) edge node [below]  {$\epsilon$} (4);
\path[->] (1) edge node [below]  {$\epsilon$} (5);
\path[->] (a1) edge [bend left=45] node [below]  {$\epsilon$} (1);
\path[->] (a2) edge [bend right] node [below]  {$\epsilon$} (1);
\path[->] (a3) edge [bend left=45] node [below]  {$\epsilon$} (1);
\begin{pgfonlayer}{background}
\node (2) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (1) (a2) (a3)] {};
\node [yshift=3mm] at (2.north) {$r^*$};
\end{pgfonlayer}
\end{tikzpicture}    
\end{tabular}
\end{center}

\noindent 
The corresponding code is in Figure~\ref{thompson2} in Lines 35--43)

To sum up, you can see in the sequence and star cases the need of
having silent $\epsilon$-transitions. Similarly the alternative case
shows the need of the NFA-nondeterminism. It seems awkward to form the
`alternative' composition of two DFAs, because DFA do not allow
several starting and successor states. All these constructions can now
be put together in the following recursive function:


{\small\begin{lstlisting}[language=Scala]
def thompson(r: Rexp) : NFAt = r match {
  case ZERO => NFA_ZERO()
  case ONE => NFA_ONE()
  case CHAR(c) => NFA_CHAR(c)  
  case ALT(r1, r2) => NFA_ALT(thompson(r1), thompson(r2))
  case SEQ(r1, r2) => NFA_SEQ(thompson(r1), thompson(r2))
  case STAR(r1) => NFA_STAR(thompson(r1))
}
\end{lstlisting}}

\noindent
It calculates a NFA from a regular expressions. At last we can run
NFAs for the our evil regular expression examples.  The graph on the
left shows that when translating a regular expression such as
$a^{\{n\}}$ into a NFA, the size can blow up and then even the
relative fast (on small examples) breadth-first search can be
slow. The graph on the right shows that with `evil' regular
expressions the depth-first search can be abysmally slow. Even if the
graphs not completely overlap with the curves of Python, Ruby and
Java, they are similar enough. OK\ldots now you know why regular
expression matchers in those languages are so slow. 


\begin{center}
\begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1mm}}c@{}}  
\begin{tikzpicture}
\begin{axis}[
    title={Graph: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings 
           $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
    title style={yshift=-2ex},
    xlabel={$n$},
    x label style={at={(1.05,0.0)}},
    ylabel={time in secs},
    enlargelimits=false,
    xtick={0,5,...,30},
    xmax=33,
    ymax=35,
    ytick={0,5,...,30},
    scaled ticks=false,
    axis lines=left,
    width=5.5cm,
    height=4cm, 
    legend entries={Python,Ruby, breadth-first NFA},
    legend style={at={(0.5,-0.25)},anchor=north,font=\small},
    legend cell align=left]
  \addplot[blue,mark=*, mark options={fill=white}] table {re-python.data};
  \addplot[brown,mark=triangle*, mark options={fill=white}] table {re-ruby.data};
  % breath-first search in NFAs
  \addplot[red,mark=*, mark options={fill=white}] table {
    1 0.00586
    2 0.01209
    3 0.03076
    4 0.08269
    5 0.12881
    6 0.25146
    7 0.51377
    8 0.89079
    9 1.62802
    10 3.05326
    11 5.92437
    12 11.67863
    13 24.00568
  };
\end{axis}
\end{tikzpicture}
&
\begin{tikzpicture}
\begin{axis}[
    title={Graph: $(a^*)^* \cdot b$ and strings 
           $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
    title style={yshift=-2ex},
    xlabel={$n$},
    x label style={at={(1.05,0.0)}},
    ylabel={time in secs},
    enlargelimits=false,
    xtick={0,5,...,30},
    xmax=33,
    ymax=35,
    ytick={0,5,...,30},
    scaled ticks=false,
    axis lines=left,
    width=5.5cm,
    height=4cm, 
    legend entries={Python, Java, depth-first NFA},
    legend style={at={(0.5,-0.25)},anchor=north,font=\small},
    legend cell align=left]
  \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
  \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};  
  % depth-first search in NFAs
  \addplot[red,mark=*, mark options={fill=white}] table {
    1 0.00605
    2 0.03086
    3 0.11994
    4 0.45389
    5 2.06192
    6 8.04894
    7 32.63549
  };
\end{axis}
\end{tikzpicture}
\end{tabular}
\end{center}



\subsection*{Subset Construction}

Of course, some developers of regular expression matchers are aware of
these problems with sluggish NFAs and try to address them. One common
technique for alleviating the problem I like to show you in this
section. This will also explain why we insisted on polymorphic types in
our DFA code (remember I used \texttt{A} and \texttt{C} for the types
of states and the input, see Figure~\ref{dfa} on
Page~\pageref{dfa}).\bigskip

\noindent
To start remember that we did not bother with defining and
implementing $\epsilon$NFAs: we immediately translated them into
equivalent NFAs. Equivalent in the sense of accepting the same
language (though we only claimed this and did not prove it
rigorously). Remember also that NFAs have non-deterministic
transitions defined as a relation or implemented as function returning
sets of states.  This non-determinism is crucial for the Thompson
Construction to work (recall the cases for $\cdot$, $+$ and
${}^*$). But this non-determinism makes it harder with NFAs to decide
when a string is accepted or not; whereas such a decision is rather
straightforward with DFAs: recall their transition function is a
\emph{function} that returns a single state. So with DFAs we do not
have to search at all.  What is perhaps interesting is the fact that
for every NFA we can find a DFA that also recognises the same
language. This might sound a bit paradoxical: NFA $\rightarrow$
decision of acceptance hard; DFA $\rightarrow$ decision easy. But this
\emph{is} true\ldots but of course there is always a caveat---nothing
ever is for free in life.

There are actually a number of methods for transforming a NFA into
an equivalent DFA, but the most famous one is the \emph{subset
  construction}. Consider the following NFA where the states are
labelled with $0$, $1$ and $2$.

\begin{center}
\begin{tabular}{c@{\hspace{10mm}}c}
\begin{tikzpicture}[scale=0.7,>=stealth',very thick,
                    every state/.style={minimum size=0pt,
                    draw=blue!50,very thick,fill=blue!20},
                    baseline=(current bounding box.center)]
\node[state,initial]  (Q_0)  {$0$};
\node[state] (Q_1) [below=of Q_0] {$1$};
\node[state, accepting] (Q_2) [below=of Q_1] {$2$};

\path[->] (Q_0) edge node [right]  {$b$} (Q_1);
\path[->] (Q_1) edge node [right]  {$a,b$} (Q_2);
\path[->] (Q_0) edge [loop above] node  {$a, b$} ();
\end{tikzpicture}
&
\begin{tabular}{r|ll}
states & $a$ & $b$\\
\hline
$\{\}\phantom{\star}$ & $\{\}$ & $\{\}$\\
start: $\{0\}\phantom{\star}$       & $\{0\}$   & $\{0,1\}$\\
$\{1\}\phantom{\star}$       & $\{2\}$       & $\{2\}$\\
$\{2\}\star$  & $\{\}$ & $\{\}$\\
$\{0,1\}\phantom{\star}$     & $\{0,2\}$   & $\{0,1,2\}$\\
$\{0,2\}\star$ & $\{0\}$   & $\{0,1\}$\\
$\{1,2\}\star$ & $\{2\}$       & $\{2\}$\\
$\{0,1,2\}\star$ & $\{0,2\}$ & $\{0,1,2\}$\\
\end{tabular}
\end{tabular}
\end{center}

\noindent The states of the corresponding DFA are given by generating
all subsets of the set $\{0,1,2\}$ (seen in the states column
in the table on the right). The other columns define the transition
function for the DFA for inputs $a$ and $b$. The first row states that
$\{\}$ is the sink state which has transitions for $a$ and $b$ to
itself.  The next three lines are calculated as follows:

\begin{itemize}
\item Suppose you calculate the entry for the $a$-transition for state
  $\{0\}$. Look for all states in the NFA that can be reached by such
  a transition from this state; this is only state $0$; therefore from
  state $\{0\}$ we can go to state $\{0\}$ via an $a$-transition.
\item Do the same for the $b$-transition; you can reach states $0$ and
  $1$ in the NFA; therefore in the DFA we can go from state $\{0\}$ to
  state $\{0,1\}$ via an $b$-transition.
\item Continue with the states $\{1\}$ and $\{2\}$.
\end{itemize}

\noindent
Once you filled in the transitions for `simple' states $\{0\}$
.. $\{2\}$, you only have to build the union for the compound states
$\{0,1\}$, $\{0,2\}$ and so on. For example for $\{0,1\}$ you take the
union of Line $\{0\}$ and Line $\{1\}$, which gives $\{0,2\}$ for $a$,
and $\{0,1,2\}$ for $b$. And so on.

The starting state of the DFA can be calculated from the starting
states of the NFA, that is in this case $\{0\}$. But in general there
can of course be many starting states in the NFA and you would take
the corresponding subset as \emph{the} starting state of the DFA.

The accepting states in the DFA are given by all sets that contain a
$2$, which is the only accpting state in this NFA. But again in
general if the subset contains any accepting state from the NFA, then
the corresponding state in the DFA is accepting as well.  This
completes the subset construction. The corresponding DFA for the NFA
shown above is:

\begin{equation}
\begin{tikzpicture}[scale=0.8,>=stealth',very thick,
                    every state/.style={minimum size=0pt,
                      draw=blue!50,very thick,fill=blue!20},
                    baseline=(current bounding box.center)]
\node[state,initial]  (q0)  {$0$};
\node[state] (q01) [right=of q0] {$0,1$};
\node[state,accepting] (q02) [below=of q01] {$0,2$};
\node[state,accepting] (q012) [right=of q02] {$0,1,2$};
\node[state] (q1) [below=0.5cm of q0] {$1$};
\node[state,accepting] (q2) [below=1cm of q1] {$2$};
\node[state] (qn) [below left=1cm of q2] {$\{\}$};
\node[state,accepting] (q12) [below right=1cm of q2] {$1,2$};

\path[->] (q0) edge node [above] {$b$} (q01);
\path[->] (q01) edge node [above] {$b$} (q012);
\path[->] (q0) edge [loop above] node  {$a$} ();
\path[->] (q012) edge [loop right] node  {$b$} ();
\path[->] (q012) edge node [below] {$a$} (q02);
\path[->] (q02) edge node [below] {$a$} (q0);
\path[->] (q01) edge [bend left] node [left]  {$a$} (q02);
\path[->] (q02) edge [bend left] node [right]  {$b$} (q01);
\path[->] (q1) edge node [left] {$a,b$} (q2);
\path[->] (q12) edge node [right] {$a, b$} (q2);
\path[->] (q2) edge node [right] {$a, b$} (qn);
\path[->] (qn) edge [loop left] node  {$a,b$} ();
\end{tikzpicture}\label{subsetdfa}
\end{equation}

\noindent
Please check that this is indeed a DFA. The big question is whether
this DFA can recognise the same language as the NFA we started with?
I let you ponder about this question.


There are also two points to note: One is that very often in the
subset construction the resulting DFA contains a number of ``dead''
states that are never reachable from the starting state. This is
obvious in the example, where state $\{1\}$, $\{2\}$, $\{1,2\}$ and
$\{\}$ can never be reached from the starting state. But this might
not always be as obvious as that. In effect the DFA in this example is
not a \emph{minimal} DFA (more about this in a minute). Such dead
states can be safely removed without changing the language that is
recognised by the DFA. Another point is that in some cases, however,
the subset construction produces a DFA that does \emph{not} contain
any dead states\ldots{}this means it calculates a minimal DFA. Which
in turn means that in some cases the number of states can by going
from NFAs to DFAs exponentially increase, namely by $2^n$ (which is
the number of subsets you can form for sets of $n$ states).  This blow
up in the number of states in the DFA is again bad news for how
quickly you can decide whether a string is accepted by a DFA or
not. So the caveat with DFAs is that they might make the task of
finding the next state trival, but might require $2^n$ times as many
states then a NFA.\bigskip

\noindent
To conclude this section, how conveniently we can
implement the subset construction with our versions of NFAs and
DFAs? Very conveninetly. The code is just:

{\small\begin{lstlisting}[language=Scala]
def subset[A, C](nfa: NFA[A, C]) : DFA[Set[A], C] = {
  DFA(nfa.starts, 
      { case (qs, c) => nfa.nexts(qs, c) }, 
      _.exists(nfa.fins))
}
\end{lstlisting}}  

\noindent
The interesting point in this code is that the state type of the
calculated DFA is \texttt{Set[A]}. Think carefully that this works out
correctly.

The DFA is then given by three components: the starting states, the
transition function and the accepting-states function.  The starting
states are a set in the given NFA, but a single state in the DFA.  The
transition function, given the state \texttt{qs} and input \texttt{c},
needs to produce the next state: this is the set of all NFA states
that are reachable from each state in \texttt{qs}. The function
\texttt{nexts} from the NFA class already calculates this for us. The
accepting-states function for the DFA is true henevner at least one
state in the subset is accepting (that is true) in the NFA.\medskip

\noindent
You might be able to spend some quality time tinkering with this code
and time to ponder about it. Then you will probably notice that it is
actually a bit silly. The whole point of translating the NFA into a
DFA via the subset construction is to make the decision of whether a
string is accepted or not faster. Given the code above, the generated
DFA will be exactly as fast, or as slow, as the NFA we started with
(actually it will even be a tiny bit slower). The reason is that we
just re-use the \texttt{nexts} function from the NFA. This function
implements the non-deterministic breadth-first search.  You might be
thinking: This is cheating! \ldots{} Well, not quite as you will see
later, but in terms of speed we still need to work a bit in order to
get sometimes(!) a faster DFA. Let's do this next.

\subsection*{DFA Minimisation}

As seen in \eqref{subsetdfa}, the subset construction from NFA to a
DFA can result in a rather ``inefficient'' DFA. Meaning there are
states that are not needed. There are two kinds of such unneeded
states: \emph{unreachable} states and \emph{non-distinguishable}
states. The first kind of states can just be removed without affecting
the language that can be recognised (after all they are
unreachable). The second kind can also be recognised and thus a DFA
can be \emph{minimised} by the following algorithm:

\begin{enumerate}
\item Take all pairs $(q, p)$ with $q \not= p$
\item Mark all pairs that accepting and non-accepting states
\item For all unmarked pairs $(q, p)$ and all characters $c$
      test whether 
      
      \begin{center} 
      $(\delta(q, c), \delta(p,c))$ 
      \end{center} 
      
      are marked. If there is one, then also mark $(q, p)$.
\item Repeat last step until no change.
\item All unmarked pairs can be merged.
\end{enumerate}

\noindent Unfortunately, once we throw away all unreachable states in
\eqref{subsetdfa}, all remaining states are needed.  In order to
illustrate the minimisation algorithm, consider the following DFA.

\begin{center}
\begin{tikzpicture}[>=stealth',very thick,auto,
                    every state/.style={minimum size=0pt,
                    inner sep=2pt,draw=blue!50,very thick,
                    fill=blue!20}]
\node[state,initial]  (Q_0)  {$Q_0$};
\node[state] (Q_1) [right=of Q_0] {$Q_1$};
\node[state] (Q_2) [below right=of Q_0] {$Q_2$};
\node[state] (Q_3) [right=of Q_2] {$Q_3$};
\node[state, accepting] (Q_4) [right=of Q_1] {$Q_4$};
\path[->] (Q_0) edge node [above]  {$a$} (Q_1);
\path[->] (Q_1) edge node [above]  {$a$} (Q_4);
\path[->] (Q_4) edge [loop right] node  {$a, b$} ();
\path[->] (Q_3) edge node [right]  {$a$} (Q_4);
\path[->] (Q_2) edge node [above]  {$a$} (Q_3);
\path[->] (Q_1) edge node [right]  {$b$} (Q_2);
\path[->] (Q_0) edge node [above]  {$b$} (Q_2);
\path[->] (Q_2) edge [loop left] node  {$b$} ();
\path[->] (Q_3) edge [bend left=95, looseness=1.3] node 
  [below]  {$b$} (Q_0);
\end{tikzpicture}
\end{center}

\noindent In Step 1 and 2 we consider essentially a triangle
of the form

\begin{center}
\begin{tikzpicture}[scale=0.6,line width=0.8mm]
\draw (0,0) -- (4,0);
\draw (0,1) -- (4,1);
\draw (0,2) -- (3,2);
\draw (0,3) -- (2,3);
\draw (0,4) -- (1,4);

\draw (0,0) -- (0, 4);
\draw (1,0) -- (1, 4);
\draw (2,0) -- (2, 3);
\draw (3,0) -- (3, 2);
\draw (4,0) -- (4, 1);

\draw (0.5,-0.5) node {$Q_0$}; 
\draw (1.5,-0.5) node {$Q_1$}; 
\draw (2.5,-0.5) node {$Q_2$}; 
\draw (3.5,-0.5) node {$Q_3$};
 
\draw (-0.5, 3.5) node {$Q_1$}; 
\draw (-0.5, 2.5) node {$Q_2$}; 
\draw (-0.5, 1.5) node {$Q_3$}; 
\draw (-0.5, 0.5) node {$Q_4$}; 

\draw (0.5,0.5) node {\large$\star$}; 
\draw (1.5,0.5) node {\large$\star$}; 
\draw (2.5,0.5) node {\large$\star$}; 
\draw (3.5,0.5) node {\large$\star$};
\end{tikzpicture}
\end{center}

\noindent where the lower row is filled with stars, because in
the corresponding pairs there is always one state that is
accepting ($Q_4$) and a state that is non-accepting (the other
states).

In Step 3 we need to fill in more stars according whether 
one of the next-state pairs are marked. We have to do this 
for every unmarked field until there is no change anymore.
This gives the triangle

\begin{center}
\begin{tikzpicture}[scale=0.6,line width=0.8mm]
\draw (0,0) -- (4,0);
\draw (0,1) -- (4,1);
\draw (0,2) -- (3,2);
\draw (0,3) -- (2,3);
\draw (0,4) -- (1,4);

\draw (0,0) -- (0, 4);
\draw (1,0) -- (1, 4);
\draw (2,0) -- (2, 3);
\draw (3,0) -- (3, 2);
\draw (4,0) -- (4, 1);

\draw (0.5,-0.5) node {$Q_0$}; 
\draw (1.5,-0.5) node {$Q_1$}; 
\draw (2.5,-0.5) node {$Q_2$}; 
\draw (3.5,-0.5) node {$Q_3$};
 
\draw (-0.5, 3.5) node {$Q_1$}; 
\draw (-0.5, 2.5) node {$Q_2$}; 
\draw (-0.5, 1.5) node {$Q_3$}; 
\draw (-0.5, 0.5) node {$Q_4$}; 

\draw (0.5,0.5) node {\large$\star$}; 
\draw (1.5,0.5) node {\large$\star$}; 
\draw (2.5,0.5) node {\large$\star$}; 
\draw (3.5,0.5) node {\large$\star$};
\draw (0.5,1.5) node {\large$\star$}; 
\draw (2.5,1.5) node {\large$\star$}; 
\draw (0.5,3.5) node {\large$\star$}; 
\draw (1.5,2.5) node {\large$\star$}; 
\end{tikzpicture}
\end{center}

\noindent which means states $Q_0$ and $Q_2$, as well as $Q_1$
and $Q_3$ can be merged. This gives the following minimal DFA

\begin{center}
\begin{tikzpicture}[>=stealth',very thick,auto,
                    every state/.style={minimum size=0pt,
                    inner sep=2pt,draw=blue!50,very thick,
                    fill=blue!20}]
\node[state,initial]  (Q_02)  {$Q_{0, 2}$};
\node[state] (Q_13) [right=of Q_02] {$Q_{1, 3}$};
\node[state, accepting] (Q_4) [right=of Q_13] 
  {$Q_{4\phantom{,0}}$};
\path[->] (Q_02) edge [bend left] node [above]  {$a$} (Q_13);
\path[->] (Q_13) edge [bend left] node [below]  {$b$} (Q_02);
\path[->] (Q_02) edge [loop below] node  {$b$} ();
\path[->] (Q_13) edge node [above]  {$a$} (Q_4);
\path[->] (Q_4) edge [loop above] node  {$a, b$} ();
\end{tikzpicture}
\end{center}


By the way, we are not bothering with implementing the above
minimisation algorith: while up to now all the transformations used
some clever composition of functions, the minimisation algorithm
cannot be implemented by just composing some functions. For this we
would require a more concrete representation of the transition
function (like maps). If we did this, however, then many advantages of
the functions would be thrown away. So the compromise is to not being
able to minimise (easily) our DFAs.

\subsection*{Brzozowski's Method}

I know this handout is already a long, long rant: but after all it is
a topic that has been researched for more than 60 years. If you
reflect on what you have read so far, the story is that you can take a
regular expression, translate it via the Thompson Construction into an
$\epsilon$NFA, then translate it into a NFA by removing all
$\epsilon$-transitions, and then via the subset construction obtain a
DFA. In all steps we made sure the language, or which strings can be
recognised, stays the same. Of couse we should have proved this in
each step, but let us cut corners here.  After the last section, we
can even minimise the DFA (maybe not in code). But again we made sure
the same language is recognised. You might be wondering: Can we go
into the other direction?  Can we go from a DFA and obtain a regular
expression that can recognise the same language as the DFA?\medskip

\noindent
The answer is yes. Again there are several methods for calculating a
regular expression for a DFA. I will show you Brzozowski's method
because it calculates a regular expression using quite familiar
transformations for solving equational systems. Consider the DFA:

\begin{center}
\begin{tikzpicture}[scale=1.5,>=stealth',very thick,auto,
                    every state/.style={minimum size=0pt,
                    inner sep=2pt,draw=blue!50,very thick,
                    fill=blue!20}]
  \node[state, initial]        (q0) at ( 0,1) {$Q_0$};
  \node[state]                    (q1) at ( 1,1) {$Q_1$};
  \node[state, accepting] (q2) at ( 2,1) {$Q_2$};
  \path[->] (q0) edge[bend left] node[above] {$a$} (q1)
            (q1) edge[bend left] node[above] {$b$} (q0)
            (q2) edge[bend left=50] node[below] {$b$} (q0)
            (q1) edge node[above] {$a$} (q2)
            (q2) edge [loop right] node {$a$} ()
            (q0) edge [loop below] node {$b$} ();
\end{tikzpicture}
\end{center}

\noindent for which we can set up the following equational
system

\begin{eqnarray}
Q_0 & = & \ONE + Q_0\,b + Q_1\,b +  Q_2\,b\\
Q_1 & = & Q_0\,a\\
Q_2 & = & Q_1\,a + Q_2\,a
\end{eqnarray}

\noindent There is an equation for each node in the DFA. Let
us have a look how the right-hand sides of the equations are
constructed. First have a look at the second equation: the
left-hand side is $Q_1$ and the right-hand side $Q_0\,a$. The
right-hand side is essentially all possible ways how to end up
in node $Q_1$. There is only one incoming edge from $Q_0$ consuming
an $a$.  Therefore the right hand side is this
state followed by character---in this case $Q_0\,a$. Now lets
have a look at the third equation: there are two incoming
edges for $Q_2$. Therefore we have two terms, namely $Q_1\,a$ and
$Q_2\,a$. These terms are separated by $+$. The first states
that if in state $Q_1$ consuming an $a$ will bring you to
$Q_2$, and the second that being in $Q_2$ and consuming an $a$
will make you stay in $Q_2$. The right-hand side of the
first equation is constructed similarly: there are three 
incoming edges, therefore there are three terms. There is
one exception in that we also ``add'' $\ONE$ to the
first equation, because it corresponds to the starting state
in the DFA.

Having constructed the equational system, the question is
how to solve it? Remarkably the rules are very similar to
solving usual linear equational systems. For example the
second equation does not contain the variable $Q_1$ on the
right-hand side of the equation. We can therefore eliminate 
$Q_1$ from the system by just substituting this equation
into the other two. This gives

\begin{eqnarray}
Q_0 & = & \ONE + Q_0\,b + Q_0\,a\,b +  Q_2\,b\\
Q_2 & = & Q_0\,a\,a + Q_2\,a
\end{eqnarray}

\noindent where in Equation (4) we have two occurrences
of $Q_0$. Like the laws about $+$ and $\cdot$, we can simplify 
Equation (4) to obtain the following two equations:

\begin{eqnarray}
Q_0 & = & \ONE + Q_0\,(b + a\,b) +  Q_2\,b\\
Q_2 & = & Q_0\,a\,a + Q_2\,a
\end{eqnarray}
 
\noindent Unfortunately we cannot make any more progress with
substituting equations, because both (6) and (7) contain the
variable on the left-hand side also on the right-hand side.
Here we need to now use a law that is different from the usual
laws about linear equations. It is called \emph{Arden's rule}.
It states that if an equation is of the form $q = q\,r + s$
then it can be transformed to $q = s\, r^*$. Since we can
assume $+$ is symmetric, Equation (7) is of that form: $s$ is
$Q_0\,a\,a$ and $r$ is $a$. That means we can transform
(7) to obtain the two new equations

\begin{eqnarray}
Q_0 & = & \ONE + Q_0\,(b + a\,b) +  Q_2\,b\\
Q_2 & = & Q_0\,a\,a\,(a^*)
\end{eqnarray}

\noindent Now again we can substitute the second equation into
the first in order to eliminate the variable $Q_2$.

\begin{eqnarray}
Q_0 & = & \ONE + Q_0\,(b + a\,b) +  Q_0\,a\,a\,(a^*)\,b
\end{eqnarray}

\noindent Pulling $Q_0$ out as a single factor gives:

\begin{eqnarray}
Q_0 & = & \ONE + Q_0\,(b + a\,b + a\,a\,(a^*)\,b)
\end{eqnarray}

\noindent This equation is again of the form so that we can
apply Arden's rule ($r$ is $b + a\,b + a\,a\,(a^*)\,b$ and $s$
is $\ONE$). This gives as solution for $Q_0$ the following
regular expression:

\begin{eqnarray}
Q_0 & = & \ONE\,(b + a\,b + a\,a\,(a^*)\,b)^*
\end{eqnarray}

\noindent Since this is a regular expression, we can simplify
away the $\ONE$ to obtain the slightly simpler regular
expression

\begin{eqnarray}
Q_0 & = & (b + a\,b + a\,a\,(a^*)\,b)^*
\end{eqnarray}

\noindent 
Now we can unwind this process and obtain the solutions
for the other equations. This gives:

\begin{eqnarray}
Q_0 & = & (b + a\,b + a\,a\,(a^*)\,b)^*\\
Q_1 & = & (b + a\,b + a\,a\,(a^*)\,b)^*\,a\\
Q_2 & = & (b + a\,b + a\,a\,(a^*)\,b)^*\,a\,a\,(a)^*
\end{eqnarray}

\noindent Finally, we only need to ``add'' up the equations
which correspond to a terminal state. In our running example,
this is just $Q_2$. Consequently, a regular expression
that recognises the same language as the DFA is

\[
(b + a\,b + a\,a\,(a^*)\,b)^*\,a\,a\,(a)^*
\]

\noindent You can somewhat crosscheck your solution by taking a string
the regular expression can match and and see whether it can be matched
by the DFA.  One string for example is $aaa$ and \emph{voila} this
string is also matched by the automaton.

We should prove that Brzozowski's method really produces an equivalent
regular expression. But for the purposes of this module, we omit
this. I guess you are relieved.


\subsection*{Regular Languages}

Given the constructions in the previous sections we obtain 
the following overall picture:

\begin{center}
\begin{tikzpicture}
\node (rexp)  {\bf Regexps};
\node (nfa) [right=of rexp] {\bf NFAs};
\node (dfa) [right=of nfa] {\bf DFAs};
\node (mdfa) [right=of dfa] {\bf\begin{tabular}{c}minimal\\ DFAs\end{tabular}};
\path[->,line width=1mm] (rexp) edge node [above=4mm, black] {\begin{tabular}{c@{\hspace{9mm}}}Thompson's\\[-1mm] construction\end{tabular}} (nfa);
\path[->,line width=1mm] (nfa) edge node [above=4mm, black] {\begin{tabular}{c}subset\\[-1mm] construction\end{tabular}}(dfa);
\path[->,line width=1mm] (dfa) edge node [below=5mm, black] {minimisation} (mdfa);
\path[->,line width=1mm] (dfa) edge [bend left=45] node [below] {\begin{tabular}{l}Brzozowski's\\ method\end{tabular}} (rexp);
\end{tikzpicture}
\end{center}

\noindent By going from regular expressions over NFAs to DFAs,
we can always ensure that for every regular expression there
exists a NFA and a DFA that can recognise the same language.
Although we did not prove this fact. Similarly by going from
DFAs to regular expressions, we can make sure for every DFA 
there exists a regular expression that can recognise the same
language. Again we did not prove this fact. 

The fundamental conclusion we can draw is that automata and regular
expressions can recognise the same set of languages:

\begin{quote} A language is \emph{regular} iff there exists a
regular expression that recognises all its strings.
\end{quote}

\noindent or equivalently 

\begin{quote} A language is \emph{regular} iff there exists an
automaton that recognises all its strings.
\end{quote}

\noindent Note that this is not a stement for a particular language
(that is a particular set of strings), but about a large class of
languages, namely the regular ones.

As a consequence for deciding whether a string is recognised by a
regular expression, we could use our algorithm based on derivatives or
NFAs or DFAs. But let us quickly look at what the differences mean in
computational terms. Translating a regular expression into a NFA gives
us an automaton that has $O(n)$ states---that means the size of the
NFA grows linearly with the size of the regular expression. The
problem with NFAs is that the problem of deciding whether a string is
accepted or not is computationally not cheap. Remember with NFAs we
have potentially many next states even for the same input and also
have the silent $\epsilon$-transitions. If we want to find a path from
the starting state of a NFA to an accepting state, we need to consider
all possibilities. In Ruby, Python and Java this is done by a
depth-first search, which in turn means that if a ``wrong'' choice is
made, the algorithm has to backtrack and thus explore all potential
candidates. This is exactly the reason why Ruby, Python and Java are
so slow for evil regular expressions. An alternative to the
potentially slow depth-first search is to explore the search space in
a breadth-first fashion, but this might incur a big memory penalty.

To avoid the problems with NFAs, we can translate them into DFAs. With
DFAs the problem of deciding whether a string is recognised or not is
much simpler, because in each state it is completely determined what
the next state will be for a given input. So no search is needed.  The
problem with this is that the translation to DFAs can explode
exponentially the number of states. Therefore when this route is
taken, we definitely need to minimise the resulting DFAs in order to
have an acceptable memory and runtime behaviour. But remember the
subset construction in the worst case explodes the number of states by
$2^n$.  Effectively also the translation to DFAs can incur a big
runtime penalty.

But this does not mean that everything is bad with automata.  Recall
the problem of finding a regular expressions for the language that is
\emph{not} recognised by a regular expression. In our implementation
we added explicitly such a regular expressions because they are useful
for recognising comments. But in principle we did not need to. The
argument for this is as follows: take a regular expression, translate
it into a NFA and then a DFA that both recognise the same
language. Once you have the DFA it is very easy to construct the
automaton for the language not recognised by a DFA. If the DFA is
completed (this is important!), then you just need to exchange the
accepting and non-accepting states. You can then translate this DFA
back into a regular expression and that will be the regular expression
that can match all strings the original regular expression could
\emph{not} match.

It is also interesting that not all languages are regular. The most
well-known example of a language that is not regular consists of all
the strings of the form

\[a^n\,b^n\]

\noindent meaning strings that have the same number of $a$s and
$b$s. You can try, but you cannot find a regular expression for this
language and also not an automaton. One can actually prove that there
is no regular expression nor automaton for this language, but again
that would lead us too far afield for what we want to do in this
module.


\subsection*{Where Have Derivatives Gone?}

Still to be done\bigskip

%\noindent
%By now you are probably fed up with this text. It is now way too long
%for one lecture, but there is still one aspect of the
%automata-regular-expression-connection I like to describe. Perhaps by
%now you are asking yourself: Where have the derivatives gone? Did we
%just forget them?  Well, they have a place in the picture of
%calculating a DFA from the regular expression.

%To be done
%
%\begin{center}
%\begin{tikzpicture}
%  [level distance=25mm,very thick,auto,
%   level 1/.style={sibling distance=30mm},
%   level 2/.style={sibling distance=15mm},
%   every node/.style={minimum size=30pt,
%                    inner sep=0pt,circle,draw=blue!50,very thick,
%                    fill=blue!20}]
%  \node {$r$} [grow=right]
%  child[->] {node (cn) {$d_{c_n}$}
%    child { node {$dd_{c_nc_n}$}}
%    child { node {$dd_{c_nc_1}$}}
%    %edge from parent node[left] {$c_n$}
%  }
%  child[->] {node (c1) {$d_{c_1}$}
%    child { node {$dd_{c_1c_n}$}}
%    child { node {$dd_{c_1c_1}$}}
%    %edge from parent node[left] {$c_1$}
%  };
%  %%\draw (cn) -- (c1) node {\vdots}; 
%\end{tikzpicture}  
%\end{center}

%\section*{Further Reading}

%Compare what a ``human expert'' would create as an automaton for the
%regular expression $a\cdot (b + c)^*$ and what the Thomson
%algorithm generates.

%http://www.inf.ed.ac.uk/teaching/courses/ct/
\end{document}

%%% Local Variables: 
%%% mode: latex
%%% TeX-master: t
%%% End: