\documentclass{article}
\usepackage{../style}
\usepackage{../langs}
\usepackage{../graphics}
\begin{document}
\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017}
\section*{Handout 3 (Finite Automata)}
Every formal language and compiler course I know of bombards you first
with automata and then to a much, much smaller extend with regular
expressions. As you can see, this course is turned upside down:
regular expressions come first. The reason is that regular expressions
are easier to reason about and the notion of derivatives, although
already quite old, only became more widely known rather
recently. Still, let us in this lecture have a closer look at automata
and their relation to regular expressions. This will help us with
understanding why the regular expression matchers in Python, Ruby and
Java are so slow with certain regular expressions. On the way we will
also see what are the limitations of regular
expressions. Unfortunately, they cannot be used for \emph{everything}.
\subsection*{Deterministic Finite Automata}
Lets start\ldots the central definition is:\medskip
\noindent
A \emph{deterministic finite automaton} (DFA), say $A$, is
given by a five-tuple written ${\cal A}(\varSigma, Qs, Q_0, F, \delta)$ where
\begin{itemize}
\item $\varSigma$ is an alphabet,
\item $Qs$ is a finite set of states,
\item $Q_0 \in Qs$ is the start state,
\item $F \subseteq Qs$ are the accepting states, and
\item $\delta$ is the transition function.
\end{itemize}
\noindent I am sure you have seen this definition already
before. The transition function determines how to ``transition'' from
one state to the next state with respect to a character. We have the
assumption that these transition functions do not need to be defined
everywhere: so it can be the case that given a character there is no
next state, in which case we need to raise a kind of ``failure
exception''. That means actually we have \emph{partial} functions as
transitions---see the Scala implementation of DFAs later on. A
typical example of a DFA is
\begin{center}
\begin{tikzpicture}[>=stealth',very thick,auto,
every state/.style={minimum size=0pt,
inner sep=2pt,draw=blue!50,very thick,
fill=blue!20},scale=2]
\node[state,initial] (Q_0) {$Q_0$};
\node[state] (Q_1) [right=of Q_0] {$Q_1$};
\node[state] (Q_2) [below right=of Q_0] {$Q_2$};
\node[state] (Q_3) [right=of Q_2] {$Q_3$};
\node[state, accepting] (Q_4) [right=of Q_1] {$Q_4$};
\path[->] (Q_0) edge node [above] {$a$} (Q_1);
\path[->] (Q_1) edge node [above] {$a$} (Q_4);
\path[->] (Q_4) edge [loop right] node {$a, b$} ();
\path[->] (Q_3) edge node [right] {$a$} (Q_4);
\path[->] (Q_2) edge node [above] {$a$} (Q_3);
\path[->] (Q_1) edge node [right] {$b$} (Q_2);
\path[->] (Q_0) edge node [above] {$b$} (Q_2);
\path[->] (Q_2) edge [loop left] node {$b$} ();
\path[->] (Q_3) edge [bend left=95, looseness=1.3] node [below] {$b$} (Q_0);
\end{tikzpicture}
\end{center}
\noindent In this graphical notation, the accepting state $Q_4$ is
indicated with double circles. Note that there can be more than one
accepting state. It is also possible that a DFA has no accepting
state at all, or that the starting state is also an accepting
state. In the case above the transition function is defined everywhere
and can also be given as a table as follows:
\[
\begin{array}{lcl}
(Q_0, a) &\rightarrow& Q_1\\
(Q_0, b) &\rightarrow& Q_2\\
(Q_1, a) &\rightarrow& Q_4\\
(Q_1, b) &\rightarrow& Q_2\\
(Q_2, a) &\rightarrow& Q_3\\
(Q_2, b) &\rightarrow& Q_2\\
(Q_3, a) &\rightarrow& Q_4\\
(Q_3, b) &\rightarrow& Q_0\\
(Q_4, a) &\rightarrow& Q_4\\
(Q_4, b) &\rightarrow& Q_4\\
\end{array}
\]
We need to define the notion of what language is accepted by
an automaton. For this we lift the transition function
$\delta$ from characters to strings as follows:
\[
\begin{array}{lcl}
\widehat{\delta}(q, []) & \dn & q\\
\widehat{\delta}(q, c\!::\!s) & \dn & \widehat{\delta}(\delta(q, c), s)\\
\end{array}
\]
\noindent This lifted transition function is often called
\emph{delta-hat}. Given a string, we start in the starting state and
take the first character of the string, follow to the next state, then
take the second character and so on. Once the string is exhausted and
we end up in an accepting state, then this string is accepted by the
automaton. Otherwise it is not accepted. This also means that if along
the way we hit the case where the transition function $\delta$ is not
defined, we need to raise an error. In our implementation we will deal
with this case elegantly by using Scala's \texttt{Try}. Summing up: a
string $s$ is in the \emph{language accepted by the automaton} ${\cal
A}(\varSigma, Q, Q_0, F, \delta)$ iff
\[
\widehat{\delta}(Q_0, s) \in F
\]
\noindent I let you think about a definition that describes the set of
all strings accepted by a deterministic finite automaton.
\begin{figure}[p]
\small
\lstinputlisting[numbers=left]{../progs/display/dfa.scala}
\caption{A Scala implementation of DFAs using partial functions.
Note some subtleties: \texttt{deltas} implements the delta-hat
construction by lifting the (partial) transition function to lists
of characters. Since \texttt{delta} is given as a partial function,
it can obviously go ``wrong'' in which case the \texttt{Try} in
\texttt{accepts} catches the error and returns \texttt{false}---that
means the string is not accepted. The example \texttt{delta} in
Line 28--38 implements the DFA example shown earlier in the
handout.\label{dfa}}
\end{figure}
My take on an implementation of DFAs in Scala is given in
Figure~\ref{dfa}. As you can see, there are many features of the
mathematical definition that are quite closely reflected in the
code. In the DFA-class, there is a starting state, called
\texttt{start}, with the polymorphic type \texttt{A}. There is a
partial function \texttt{delta} for specifying the transitions---these
partial functions take a state (of polymorphic type \texttt{A}) and an
input (of polymorphic type \texttt{C}) and produce a new state (of
type \texttt{A}). For the moment it is OK to assume that \texttt{A} is
some arbitrary type for states and the input is just characters. (The
reason for not having concrete types, but polymorphic types for the
states and the input of DFAs will become clearer later on.)
The DFA-class has also an argument for specifying final states. In the
implementation it is not a set of states, as in the mathematical
definition, but a function from states to booleans (this function is
supposed to return true whenever a state is final; false
otherwise). While this boolean function is different from the sets of
states, Scala allows to use sets for such functions (see Line 40 where
the DFA is initialised). Again it will become clear later on why I use
functions for final states, rather than sets.
The most important point in the implementation is that I use Scala's
partial functions for representing the transitions; alternatives would
have been \texttt{Maps} or even \texttt{Lists}. One of the main
advantages of using partial functions is that transitions can be quite
nicely defined by a series of \texttt{case} statements (see Lines 28
-- 38 for an example). If you need to represent an automaton with a
sink state (catch-all-state), you can use Scala's pattern matching and
write something like
{\small\begin{lstlisting}[language=Scala]
abstract class State
...
case object Sink extends State
val delta : (State, Char) :=> State =
{ case (S0, 'a') => S1
case (S1, 'a') => S2
case _ => Sink
}
\end{lstlisting}}
\noindent I let you think what the corresponding DFA looks like in the
graph notation.
Finally, I let you ponder whether this is a good implementation of
DFAs or not. In doing so I hope you notice that the $\varSigma$ and
$Qs$ components (the alphabet and the set of finite states,
respectively) are missing from the class definition. This means that
the implementation allows you to do some fishy things you are not
meant to do with DFAs. Which fishy things could that be?
\subsection*{Non-Deterministic Finite Automata}
Remember we want to find out what the relation is between regular
expressions and automata. To do this with DFAs is a bit unwieldy.
While with DFAs it is always clear that given a state and a character
what the next state is (potentially none), it will be convenient to
relax this restriction. That means we allow states to have several
potential successor states. We even allow more than one starting
state. The resulting construction is called a \emph{Non-Deterministic
Finite Automaton} (NFA) given also as a five-tuple ${\cal
A}(\varSigma, Qs, Q_{0s}, F, \rho)$ where
\begin{itemize}
\item $\varSigma$ is an alphabet,
\item $Qs$ is a finite set of states
\item $Q_{0s}$ is a set of start states ($Q_{0s} \subseteq Qs$)
\item $F$ are some accepting states with $F \subseteq Qs$, and
\item $\rho$ is a transition relation.
\end{itemize}
\noindent
A typical example of a NFA is
% A NFA for (ab* + b)*a
\begin{center}
\begin{tikzpicture}[>=stealth',very thick, auto,
every state/.style={minimum size=0pt,inner sep=3pt,
draw=blue!50,very thick,fill=blue!20},scale=2]
\node[state,initial] (Q_0) {$Q_0$};
\node[state] (Q_1) [right=of Q_0] {$Q_1$};
\node[state, accepting] (Q_2) [right=of Q_1] {$Q_2$};
\path[->] (Q_0) edge [loop above] node {$b$} ();
\path[<-] (Q_0) edge node [below] {$b$} (Q_1);
\path[->] (Q_0) edge [bend left] node [above] {$a$} (Q_1);
\path[->] (Q_0) edge [bend right] node [below] {$a$} (Q_2);
\path[->] (Q_1) edge [loop above] node {$a,b$} ();
\path[->] (Q_1) edge node [above] {$a$} (Q_2);
\end{tikzpicture}
\end{center}
\noindent
This NFA happens to have only one starting state, but in general there
could be more. Notice that in state $Q_0$ we might go to state $Q_1$
\emph{or} to state $Q_2$ when receiving an $a$. Similarly in state
$Q_1$ and receiving an $a$, we can stay in state $Q_1$ \emph{or} go to
$Q_2$. This kind of choice is not allowed with DFAs. The downside of
this choice in NFAs is that when it comes to deciding whether a string is
accepted by a NFA we potentially have to explore all possibilities. I
let you think which strings the above NFA accepts.
There are a number of additional points you should note about
NFAs. Every DFA is a NFA, but not vice versa. The $\rho$ in NFAs is a
transition \emph{relation} (DFAs have a transition function). The
difference between a function and a relation is that a function has
always a single output, while a relation gives, roughly speaking,
several outputs. Look again at the NFA above: if you are currently in
the state $Q_1$ and you read a character $b$, then you can transition
to either $Q_0$ \emph{or} $Q_2$. Which route, or output, you take is
not determined. This non-determinism can be represented by a
relation.
My implementation of NFAs in Scala is shown in Figure~\ref{nfa}.
Perhaps interestingly, I do not actually use relations for my NFAs,
but use transition functions that return sets of states. DFAs have
partial transition functions that return a single state; my NFAs
return a set of states. I let you think about this representation for
NFA-transitions and how it corresponds to the relations used in the
mathematical definition of NFAs. An example of a transition function
in Scala for the NFA shown above is
{\small\begin{lstlisting}[language=Scala]
val nfa_delta : (State, Char) :=> Set[State] =
{ case (Q0, 'a') => Set(Q1, Q2)
case (Q0, 'b') => Set(Q0)
case (Q1, 'a') => Set(Q1, Q2)
case (Q1, 'b') => Set(Q0, Q1) }
\end{lstlisting}}
Like in the mathematical definition, \texttt{starts} is in
NFAs a set of states; \texttt{fins} is again a function from states to
booleans. The \texttt{next} function calculates the set of next states
reachable from a single state \texttt{q} by a character~\texttt{c}. In
case there is no such state---the partial transition function is
undefined---the empty set is returned (see function
\texttt{applyOrElse} in Lines 9 and 10). The function \texttt{nexts}
just lifts this function to sets of states.
\begin{figure}[p]
\small
\lstinputlisting[numbers=left]{../progs/display/nfa.scala}
\caption{A Scala implementation of NFAs using partial functions.
Notice that the function \texttt{accepts} implements the
acceptance of a string in a breath-first search fashion. This can be a costly
way of deciding whether a string is accepted or not in applications that need to handle
large NFAs and large inputs.\label{nfa}}
\end{figure}
Look very careful at the \texttt{accepts} and \texttt{deltas}
functions in NFAs and remember that when accepting a string by a NFA
we might have to explore all possible transitions (recall which state
to go to is not unique anymore with NFAs\ldots{}we need to explore
potentially all next states). The implementation achieves this
exploration through a \emph{breadth-first search}. This is fine for
small NFAs, but can lead to real memory problems when the NFAs are
bigger and larger strings need to be processed. As result, some
regular expression matching engines resort to a \emph{depth-first
search} with \emph{backtracking} in unsuccessful cases. In our
implementation we can implement a depth-first version of
\texttt{accepts} using Scala's \texttt{exists}-function as follows:
{\small\begin{lstlisting}[language=Scala]
def search(q: A, s: List[C]) : Boolean = s match {
case Nil => fins(q)
case c::cs => next(q, c).exists(search(_, cs))
}
def accepts2(s: List[C]) : Boolean =
starts.exists(search(_, s))
\end{lstlisting}}
\noindent
This depth-first way of exploration seems to work quite efficiently in
many examples and is much less of a strain on memory. The problem is
that the backtracking can get ``catastrophic'' in some
examples---remember the catastrophic backtracking from earlier
lectures. This depth-first search with backtracking is the reason for
the abysmal performance of some regular expression matchings in Java,
Ruby and Python. I like to show you this in the next two sections.
\subsection*{Epsilon NFAs}
In order to get an idea what calculations are performed by Java \&
friends, we need a method for transforming a regular expression into
an automaton. This automaton should accept exactly those strings that
are accepted by the regular expression. The simplest and most
well-known method for this is called \emph{Thompson Construction},
after the Turing Award winner Ken Thompson. This method is by
recursion over regular expressions and depends on the non-determinism
in NFAs described in the previous section. You will see shortly why
this construction works well with NFAs, but is not so straightforward
with DFAs.
Unfortunately we are still one step away from our intended target
though---because this construction uses a version of NFAs that allows
``silent transitions''. The idea behind silent transitions is that
they allow us to go from one state to the next without having to
consume a character. We label such silent transition with the letter
$\epsilon$ and call the automata $\epsilon$NFAs. Two typical examples
of $\epsilon$NFAs are:
\begin{center}
\begin{tabular}[t]{c@{\hspace{9mm}}c}
\begin{tikzpicture}[>=stealth',very thick,
every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]
\node[state,initial] (Q_0) {$Q_0$};
\node[state] (Q_1) [above=of Q_0] {$Q_1$};
\node[state, accepting] (Q_2) [below=of Q_0] {$Q_2$};
\path[->] (Q_0) edge node [left] {$\epsilon$} (Q_1);
\path[->] (Q_0) edge node [left] {$\epsilon$} (Q_2);
\path[->] (Q_0) edge [loop right] node {$a$} ();
\path[->] (Q_1) edge [loop right] node {$a$} ();
\path[->] (Q_2) edge [loop right] node {$b$} ();
\end{tikzpicture} &
\raisebox{20mm}{
\begin{tikzpicture}[>=stealth',very thick,
every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]
\node[state,initial] (r_1) {$R_1$};
\node[state] (r_2) [above=of r_1] {$R_2$};
\node[state, accepting] (r_3) [right=of r_1] {$R_3$};
\path[->] (r_1) edge node [below] {$b$} (r_3);
\path[->] (r_2) edge [bend left] node [above] {$a$} (r_3);
\path[->] (r_1) edge [bend left] node [left] {$\epsilon$} (r_2);
\path[->] (r_2) edge [bend left] node [right] {$a$} (r_1);
\end{tikzpicture}}
\end{tabular}
\end{center}
\noindent
Consider the $\epsilon$NFA on the left-hand side: the
$\epsilon$-transitions mean you do not have to ``consume'' any part of
the input string, but ``silently'' change to a different state. In
this example, if you are in the starting state $Q_0$, you can silently
move either to $Q_1$ or $Q_2$. You can see that once you are in $Q_1$,
respectively $Q_2$, you cannot ``go back'' to the other states. So it
seems allowing $\epsilon$-transitions is a rather substantial
extension to NFAs. On first appearances, $\epsilon$-transitions might
even look rather strange, or even silly. To start with, silent
transitions make the decision whether a string is accepted by an
automaton even harder: with $\epsilon$NFAs we have to look whether we
can do first some $\epsilon$-transitions and then do a
``proper''-transition; and after any ``proper''-transition we again
have to check whether we can do again some silent transitions. Even
worse, if there is a silent transition pointing back to the same
state, then we have to be careful our decision procedure for strings
does not loop (remember the depth-first search for exploring all
states).
The obvious question is: Do we get anything in return for this hassle
with silent transitions? Well, we still have to work for it\ldots
unfortunately. If we were to follow the many textbooks on the
subject, we would now start with defining what $\epsilon$NFAs
are---that would require extending the transition relation of
NFAs. Next, we would show that the $\epsilon$NFAs are equivalent to
NFAs and so on. Once we have done all this on paper, we would need to
implement $\epsilon$NFAs. Lets try to take a shortcut instead. We are
not really interested in $\epsilon$NFAs; they are only a convenient
tool for translating regular expressions into automata. So we are not
going to implementing them explicitly, but translate them immediately
into NFAs (in a sense $\epsilon$NFAs are just a convenient API for
lazy people ;o). How does the translation work? Well we have to find
all transitions of the form
\[
q\stackrel{\epsilon}{\longrightarrow}\ldots\stackrel{\epsilon}{\longrightarrow}
\;\stackrel{a}{\longrightarrow}\;
\stackrel{\epsilon}{\longrightarrow}\ldots\stackrel{\epsilon}{\longrightarrow} q'
\]
\noindent and replace them with $q \stackrel{a}{\longrightarrow}
q'$. Doing this to the $\epsilon$NFA on the right-hand side above gives
the NFA
\begin{center}
\begin{tikzpicture}[>=stealth',very thick,
every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]
\node[state,initial] (r_1) {$R_1$};
\node[state] (r_2) [above=of r_1] {$R_2$};
\node[state, accepting] (r_3) [right=of r_1] {$R_3$};
\path[->] (r_1) edge node [above] {$b$} (r_3);
\path[->] (r_2) edge [bend left] node [above] {$a$} (r_3);
\path[->] (r_1) edge [bend left] node [left] {$a$} (r_2);
\path[->] (r_2) edge [bend left] node [right] {$a$} (r_1);
\path[->] (r_1) edge [loop below] node {$a$} ();
\path[->] (r_1) edge [bend right] node [below] {$a$} (r_3);
\end{tikzpicture}
\end{center}
\noindent where the single $\epsilon$-transition is replaced by
three additional $a$-transitions. Please do the calculations yourself
and verify that I did not forget any transition.
So in what follows, whenever we are given an $\epsilon$NFA we will
replace it by an equivalent NFA. The Scala code for this translation
is given in Figure~\ref{enfa}. The main workhorse in this code is a
function that calculates a fixpoint of function (Lines 5--10). This
function is used for ``discovering'' which states are reachable by
$\epsilon$-transitions. Once no new state is discovered, a fixpoint is
reached. This is used for example when calculating the starting states
of an equivalent NFA (see Line 36): we start with all starting states
of the $\epsilon$NFA and then look for all additional states that can
be reached by $\epsilon$-transitions. We keep on doing this until no
new state can be reached. This is what the $\epsilon$-closure, named
in the code \texttt{ecl}, calculates. Similarly, an accepting state of
the NFA is when we can reach an accepting state of the $\epsilon$NFA
by $\epsilon$-transitions.
\begin{figure}[p]
\small
\lstinputlisting[numbers=left]{../progs/display/enfa.scala}
\caption{A Scala function that translates $\epsilon$NFA into NFAs. The
transition function of $\epsilon$NFA takes as input an \texttt{Option[C]}.
\texttt{None} stands for an $\epsilon$-transition; \texttt{Some(c)}
for a ``proper'' transition consuming a character. The functions in
Lines 18--26 calculate
all states reachable by one or more $\epsilon$-transition for a given
set of states. The NFA is constructed in Lines 36--38.\label{enfa}}
\end{figure}
Also look carefully how the transitions of $\epsilon$NFAs are
implemented. The additional possibility of performing silent
transitions is encoded by using \texttt{Option[C]} as the type for the
``input''. The \texttt{Some}s stand for ``proper'' transitions where
a character is consumed; \texttt{None} stands for
$\epsilon$-transitions. The transition functions for the two
$\epsilon$NFAs from the beginning of this section can be defined as
{\small\begin{lstlisting}[language=Scala]
val enfa_trans1 : (State, Option[Char]) :=> Set[State] =
{ case (Q0, Some('a')) => Set(Q0)
case (Q0, None) => Set(Q1, Q2)
case (Q1, Some('a')) => Set(Q1)
case (Q2, Some('b')) => Set(Q2) }
val enfa_trans2 : (State, Option[Char]) :=> Set[State] =
{ case (R1, Some('b')) => Set(R3)
case (R1, None) => Set(R2)
case (R2, Some('a')) => Set(R1, R3) }
\end{lstlisting}}
\noindent
I hope you agree now with my earlier statement that the $\epsilon$NFAs
are just an API for NFAs.
\subsection*{Thompson Construction}
Having the translation of $\epsilon$NFAs to NFAs in place, we can
finally return to the problem of translating regular expressions into
equivalent NFAs. Recall that by equivalent we mean that the NFAs
recognise the same language. Consider the simple regular expressions
$\ZERO$, $\ONE$ and $c$. They can be translated into equivalent NFAs
as follows:
\begin{equation}\mbox{
\begin{tabular}[t]{l@{\hspace{10mm}}l}
\raisebox{1mm}{$\ZERO$} &
\begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
\node[state, initial] (Q_0) {$\mbox{}$};
\end{tikzpicture}\\\\
\raisebox{1mm}{$\ONE$} &
\begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
\node[state, initial, accepting] (Q_0) {$\mbox{}$};
\end{tikzpicture}\\\\
\raisebox{3mm}{$c$} &
\begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
\node[state, initial] (Q_0) {$\mbox{}$};
\node[state, accepting] (Q_1) [right=of Q_0] {$\mbox{}$};
\path[->] (Q_0) edge node [below] {$c$} (Q_1);
\end{tikzpicture}\\
\end{tabular}}\label{simplecases}
\end{equation}
\noindent
I let you think whether the NFAs can match exactly those strings the
regular expressions can match. To do this translation in code we need
a way to construct states programatically...and as an additional
constrain Scala needs to recognise that these states are being distinct.
For this I implemented in Figure~\ref{thompson1} a class
\texttt{TState} that includes a counter and a companion object that
increases this counter whenever a new state is created.\footnote{You might
have to read up what \emph{companion objects} do in Scala.}
\begin{figure}[p]
\small
\lstinputlisting[numbers=left]{../progs/display/thompson1.scala}
\caption{The first part of the Thompson Construction. Lines 7--16
implement a way of how to create new states that are all
distinct by virtue of a counter. This counter is
increased in the companion object of \texttt{TState}
whenever a new state is created. The code in Lines 24--40
constructs NFAs for the simple regular expressions $\ZERO$, $\ONE$ and $c$.
Compare the pictures given in \eqref{simplecases}.
\label{thompson1}}
\end{figure}
\begin{figure}[p]
\small
\lstinputlisting[numbers=left]{../progs/display/thompson2.scala}
\caption{The second part of the Thompson Construction implementing
the composition of NFAs according to $\cdot$, $+$ and ${}^*$.
The implicit class about rich partial functions
implements the infix operation \texttt{+++} which
combines an $\epsilon$NFA transition with a NFA transition
(both given as partial functions).\label{thompson2}}
\end{figure}
The case for the sequence regular expression $r_1 \cdot r_2$ is a bit more
complicated: Say, we are given by recursion two NFAs representing the regular
expressions $r_1$ and $r_2$ respectively.
\begin{center}
\begin{tikzpicture}[node distance=3mm,
>=stealth',very thick,
every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
\node[state, initial] (Q_0) {$\mbox{}$};
\node[state, initial] (Q_01) [below=1mm of Q_0] {$\mbox{}$};
\node[state, initial] (Q_02) [above=1mm of Q_0] {$\mbox{}$};
\node (R_1) [right=of Q_0] {$\ldots$};
\node[state, accepting] (T_1) [right=of R_1] {$\mbox{}$};
\node[state, accepting] (T_2) [above=of T_1] {$\mbox{}$};
\node[state, accepting] (T_3) [below=of T_1] {$\mbox{}$};
\node (A_0) [right=2.5cm of T_1] {$\mbox{}$};
\node[state, initial] (A_01) [above=1mm of A_0] {$\mbox{}$};
\node[state, initial] (A_02) [below=1mm of A_0] {$\mbox{}$};
\node (b_1) [right=of A_0] {$\ldots$};
\node[state, accepting] (c_1) [right=of b_1] {$\mbox{}$};
\node[state, accepting] (c_2) [above=of c_1] {$\mbox{}$};
\node[state, accepting] (c_3) [below=of c_1] {$\mbox{}$};
\begin{pgfonlayer}{background}
\node (1) [rounded corners, inner sep=1mm, thick,
draw=black!60, fill=black!20, fit= (Q_0) (R_1) (T_1) (T_2) (T_3)] {};
\node (2) [rounded corners, inner sep=1mm, thick,
draw=black!60, fill=black!20, fit= (A_0) (b_1) (c_1) (c_2) (c_3)] {};
\node [yshift=2mm] at (1.north) {$r_1$};
\node [yshift=2mm] at (2.north) {$r_2$};
\end{pgfonlayer}
\end{tikzpicture}
\end{center}
\noindent The first NFA has some accepting states and the second some
starting states. We obtain an $\epsilon$NFA for $r_1\cdot r_2$ by
connecting the accepting states of the first NFA with
$\epsilon$-transitions to the starting states of the second
automaton. By doing so we make the accepting states of the first NFA
to be non-accepting like so:
\begin{center}
\begin{tikzpicture}[node distance=3mm,
>=stealth',very thick,
every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
\node[state, initial] (Q_0) {$\mbox{}$};
\node[state, initial] (Q_01) [below=1mm of Q_0] {$\mbox{}$};
\node[state, initial] (Q_02) [above=1mm of Q_0] {$\mbox{}$};
\node (r_1) [right=of Q_0] {$\ldots$};
\node[state] (t_1) [right=of r_1] {$\mbox{}$};
\node[state] (t_2) [above=of t_1] {$\mbox{}$};
\node[state] (t_3) [below=of t_1] {$\mbox{}$};
\node (A_0) [right=2.5cm of t_1] {$\mbox{}$};
\node[state] (A_01) [above=1mm of A_0] {$\mbox{}$};
\node[state] (A_02) [below=1mm of A_0] {$\mbox{}$};
\node (b_1) [right=of A_0] {$\ldots$};
\node[state, accepting] (c_1) [right=of b_1] {$\mbox{}$};
\node[state, accepting] (c_2) [above=of c_1] {$\mbox{}$};
\node[state, accepting] (c_3) [below=of c_1] {$\mbox{}$};
\path[->] (t_1) edge (A_01);
\path[->] (t_2) edge node [above] {$\epsilon$} (A_01);
\path[->] (t_3) edge (A_01);
\path[->] (t_1) edge (A_02);
\path[->] (t_2) edge (A_02);
\path[->] (t_3) edge node [below] {$\epsilon$} (A_02);
\begin{pgfonlayer}{background}
\node (3) [rounded corners, inner sep=1mm, thick,
draw=black!60, fill=black!20, fit= (Q_0) (c_1) (c_2) (c_3)] {};
\node [yshift=2mm] at (3.north) {$r_1\cdot r_2$};
\end{pgfonlayer}
\end{tikzpicture}
\end{center}
\noindent The idea behind this construction is that the start of any
string is first recognised by the first NFA, then we silently change
to the second NFA; the ending of the string is recognised by the
second NFA...just like matching of a string by the regular expression
$r_1\cdot r_2$. The Scala code for this construction is given in
Figure~\ref{thompson2} in Lines 16--23. The starting states of the
$\epsilon$NFA are the starting states of the first NFA (corresponding
to $r_1$); the accepting function is the accepting function of the
second NFA (corresponding to $r_2$). The new transition function is
all the ``old'' transitions plus the $\epsilon$-transitions connecting
the accepting states of the first NFA to the starting states of the
first NFA (Lines 18 and 19). The $\epsilon$NFA is then immediately
translated in a NFA.
The case for the alternative regular expression $r_1 + r_2$ is
slightly different: We are given by recursion two NFAs representing
$r_1$ and $r_2$ respectively. Each NFA has some starting states and
some accepting states. We obtain a NFA for the regular expression $r_1
+ r_2$ by composing the transition functions (this crucially depends
on knowing that the states of each component NFA are distinct); and
also combine the starting states and accepting functions.
\begin{center}
\begin{tabular}[t]{ccc}
\begin{tikzpicture}[node distance=3mm,
>=stealth',very thick,
every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},
baseline=(current bounding box.center)]
\node at (0,0) (1) {$\mbox{}$};
\node (2) [above=10mm of 1] {};
\node[state, initial] (4) [above=1mm of 2] {$\mbox{}$};
\node[state, initial] (5) [below=1mm of 2] {$\mbox{}$};
\node[state, initial] (3) [below=10mm of 1] {$\mbox{}$};
\node (a) [right=of 2] {$\ldots\,$};
\node (a1) [right=of a] {$$};
\node[state, accepting] (a2) [above=of a1] {$\mbox{}$};
\node[state, accepting] (a3) [below=of a1] {$\mbox{}$};
\node (b) [right=of 3] {$\ldots$};
\node[state, accepting] (b1) [right=of b] {$\mbox{}$};
\node[state, accepting] (b2) [above=of b1] {$\mbox{}$};
\node[state, accepting] (b3) [below=of b1] {$\mbox{}$};
\begin{pgfonlayer}{background}
\node (1) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (2) (a1) (a2) (a3)] {};
\node (2) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (3) (b1) (b2) (b3)] {};
\node [yshift=3mm] at (1.north) {$r_1$};
\node [yshift=3mm] at (2.north) {$r_2$};
\end{pgfonlayer}
\end{tikzpicture}
&
\mbox{}\qquad\tikz{\draw[>=stealth,line width=2mm,->] (0,0) -- (1, 0)}\quad\mbox{}
&
\begin{tikzpicture}[node distance=3mm,
>=stealth',very thick,
every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},
baseline=(current bounding box.center)]
\node at (0,0) (1) {$\mbox{}$};
\node (2) [above=10mm of 1] {$$};
\node[state, initial] (4) [above=1mm of 2] {$\mbox{}$};
\node[state, initial] (5) [below=1mm of 2] {$\mbox{}$};
\node[state, initial] (3) [below=10mm of 1] {$\mbox{}$};
\node (a) [right=of 2] {$\ldots\,$};
\node (a1) [right=of a] {$$};
\node[state, accepting] (a2) [above=of a1] {$\mbox{}$};
\node[state, accepting] (a3) [below=of a1] {$\mbox{}$};
\node (b) [right=of 3] {$\ldots$};
\node[state, accepting] (b1) [right=of b] {$\mbox{}$};
\node[state, accepting] (b2) [above=of b1] {$\mbox{}$};
\node[state, accepting] (b3) [below=of b1] {$\mbox{}$};
%\path[->] (1) edge node [above] {$\epsilon$} (2);
%\path[->] (1) edge node [below] {$\epsilon$} (3);
\begin{pgfonlayer}{background}
\node (3) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (1) (a2) (a3) (b2) (b3)] {};
\node [yshift=3mm] at (3.north) {$r_1+ r_2$};
\end{pgfonlayer}
\end{tikzpicture}
\end{tabular}
\end{center}
\noindent The code for this construction is in Figure~\ref{thompson2}
in Lines 25--33.
Finally for the $*$-case we have a NFA for $r$ and connect its
accepting states to a new starting state via
$\epsilon$-transitions. This new starting state is also an accepting
state, because $r^*$ can recognise the empty string.
\begin{center}
\begin{tabular}[b]{@{\hspace{-4mm}}ccc@{}}
\begin{tikzpicture}[node distance=3mm,
>=stealth',very thick,
every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},
baseline=(current bounding box.north)]
\node at (0,0) (1) {$\mbox{}$};
\node[state, initial] (2) [right=16mm of 1] {$\mbox{}$};
\node (a) [right=of 2] {$\ldots$};
\node[state, accepting] (a1) [right=of a] {$\mbox{}$};
\node[state, accepting] (a2) [above=of a1] {$\mbox{}$};
\node[state, accepting] (a3) [below=of a1] {$\mbox{}$};
\begin{pgfonlayer}{background}
\node (1) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (2) (a1) (a2) (a3)] {};
\node [yshift=3mm] at (1.north) {$r$};
\end{pgfonlayer}
\end{tikzpicture}
&
\raisebox{-16mm}{\;\tikz{\draw[>=stealth,line width=2mm,->] (0,0) -- (1, 0)}}
&
\begin{tikzpicture}[node distance=3mm,
>=stealth',very thick,
every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},
baseline=(current bounding box.north)]
\node at (0,0) [state, initial,accepting] (1) {$\mbox{}$};
\node[state] (2) [right=16mm of 1] {$\mbox{}$};
\node (a) [right=of 2] {$\ldots$};
\node[state] (a1) [right=of a] {$\mbox{}$};
\node[state] (a2) [above=of a1] {$\mbox{}$};
\node[state] (a3) [below=of a1] {$\mbox{}$};
\path[->] (1) edge node [above] {$\epsilon$} (2);
\path[->] (a1) edge [bend left=45] node [above] {$\epsilon$} (1);
\path[->] (a2) edge [bend right] node [below] {$\epsilon$} (1);
\path[->] (a3) edge [bend left=45] node [below] {$\epsilon$} (1);
\begin{pgfonlayer}{background}
\node (2) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (1) (a2) (a3)] {};
\node [yshift=3mm] at (2.north) {$r^*$};
\end{pgfonlayer}
\end{tikzpicture}
\end{tabular}
\end{center}
\noindent
The corresponding code is in Figure~\ref{thompson2} in Lines 35--43)
To sum up, you can see in the sequence and star cases the need of
having silent $\epsilon$-transitions. Similarly the alternative case
shows the need of the NFA-nondeterminism. It seems awkward to form the
`alternative' composition of two DFAs, because DFA do not allow
several starting and successor states. All these constructions can now
be put together in the following recursive function:
{\small\begin{lstlisting}[language=Scala]
def thompson(r: Rexp) : NFAt = r match {
case ZERO => NFA_ZERO()
case ONE => NFA_ONE()
case CHAR(c) => NFA_CHAR(c)
case ALT(r1, r2) => NFA_ALT(thompson(r1), thompson(r2))
case SEQ(r1, r2) => NFA_SEQ(thompson(r1), thompson(r2))
case STAR(r1) => NFA_STAR(thompson(r1))
}
\end{lstlisting}}
\noindent
It calculates a NFA from a regular expressions. At last we can run
NFAs for the our evil regular expression examples. The graph on the
left shows that when translating a regular expression such as
$a^{\{n\}}$ into a NFA, the size can blow up and then even the
relative fast (on small examples) breadth-first search can be
slow. The graph on the right shows that with `evil' regular
expressions the depth-first search can be abysmally slow. Even if the
graphs not completely overlap with the curves of Python, Ruby and
Java, they are similar enough. OK\ldots now you know why regular
expression matchers in those languages are so slow.
\begin{center}
\begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1mm}}c@{}}
\begin{tikzpicture}
\begin{axis}[
title={Graph: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings
$\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
title style={yshift=-2ex},
xlabel={$n$},
x label style={at={(1.05,0.0)}},
ylabel={time in secs},
enlargelimits=false,
xtick={0,5,...,30},
xmax=33,
ymax=35,
ytick={0,5,...,30},
scaled ticks=false,
axis lines=left,
width=5.5cm,
height=4cm,
legend entries={Python,Ruby, breadth-first NFA},
legend style={at={(0.5,-0.25)},anchor=north,font=\small},
legend cell align=left]
\addplot[blue,mark=*, mark options={fill=white}] table {re-python.data};
\addplot[brown,mark=triangle*, mark options={fill=white}] table {re-ruby.data};
% breath-first search in NFAs
\addplot[red,mark=*, mark options={fill=white}] table {
1 0.00586
2 0.01209
3 0.03076
4 0.08269
5 0.12881
6 0.25146
7 0.51377
8 0.89079
9 1.62802
10 3.05326
11 5.92437
12 11.67863
13 24.00568
};
\end{axis}
\end{tikzpicture}
&
\begin{tikzpicture}
\begin{axis}[
title={Graph: $(a^*)^* \cdot b$ and strings
$\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
title style={yshift=-2ex},
xlabel={$n$},
x label style={at={(1.05,0.0)}},
ylabel={time in secs},
enlargelimits=false,
xtick={0,5,...,30},
xmax=33,
ymax=35,
ytick={0,5,...,30},
scaled ticks=false,
axis lines=left,
width=5.5cm,
height=4cm,
legend entries={Python, Java, depth-first NFA},
legend style={at={(0.5,-0.25)},anchor=north,font=\small},
legend cell align=left]
\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
% depth-first search in NFAs
\addplot[red,mark=*, mark options={fill=white}] table {
1 0.00605
2 0.03086
3 0.11994
4 0.45389
5 2.06192
6 8.04894
7 32.63549
};
\end{axis}
\end{tikzpicture}
\end{tabular}
\end{center}
\subsection*{Subset Construction}
Of course, some developers of regular expression matchers are aware
of these problems with sluggish NFAs and try to address them. One
common technique for this I like to show you in this section. It will
also explain why we insisted on polymorphic types in our DFA code
(remember I used \texttt{A} and \texttt{C} for the types of states and
the input, see Figure~\ref{dfa} on Page~\pageref{dfa}).\bigskip
\noindent
To start, remember that we did not bother with defining and
implementing $\epsilon$NFA; we immediately translated them into
equivalent NFAs. Equivalent in the sense of accepting the same
language (though we only claimed this and did not prove it
rigorously). Remember also that NFAs have non-deterministic
transitions defined as a relation or implemented as function returning
sets of states. This non-determinism is crucial for the Thompson
Construction to work (recall the cases for $\cdot$, $+$ and
${}^*$). But this non-determinism makes it harder with NFAs to decide
when a string is accepted or not; such a decision is rather
straightforward with DFAs: recall their transition function is a
\emph{function} that returns a single state. So we do not have to
search at all. What is perhaps interesting is the fact that for every
NFA we can find a DFA that also recognises the same language. This
might sound a bit paradoxical: NFA $\rightarrow$ decision of
acceptance hard; DFA $\rightarrow$ decision easy. But this \emph{is}
true\ldots but of course there is always a caveat---nothing is ever
for free in life.
There are a number of techniques for transforming a NFA into an
equivalent DFA, but the most famous one is the \emph{subset
construction}. Consider the following NFA where the states are
labelled with, say, $0$, $1$ and $2$.
\begin{center}
\begin{tabular}{c@{\hspace{10mm}}c}
\begin{tikzpicture}[scale=0.7,>=stealth',very thick,
every state/.style={minimum size=0pt,
draw=blue!50,very thick,fill=blue!20},
baseline=(current bounding box.center)]
\node[state,initial] (Q_0) {$0$};
\node[state] (Q_1) [below=of Q_0] {$1$};
\node[state, accepting] (Q_2) [below=of Q_1] {$2$};
\path[->] (Q_0) edge node [right] {$b$} (Q_1);
\path[->] (Q_1) edge node [right] {$a,b$} (Q_2);
\path[->] (Q_0) edge [loop above] node {$a, b$} ();
\end{tikzpicture}
&
\begin{tabular}{r|ll}
states & $a$ & $b$\\
\hline
$\{\}\phantom{\star}$ & $\{\}$ & $\{\}$\\
start: $\{0\}\phantom{\star}$ & $\{0\}$ & $\{0,1\}$\\
$\{1\}\phantom{\star}$ & $\{2\}$ & $\{2\}$\\
$\{2\}\star$ & $\{\}$ & $\{\}$\\
$\{0,1\}\phantom{\star}$ & $\{0,2\}$ & $\{0,1,2\}$\\
$\{0,2\}\star$ & $\{0\}$ & $\{0,1\}$\\
$\{1,2\}\star$ & $\{2\}$ & $\{2\}$\\
$\{0,1,2\}\star$ & $\{0,2\}$ & $\{0,1,2\}$\\
\end{tabular}
\end{tabular}
\end{center}
\noindent The states of the corresponding DFA are given by generating
all subsets of the set of states of the NFA (seen in the states column
in the table on the right). The other columns define the transition
function for the DFA for input $a$ and $b$. The first row states that
$\{\}$ is the sink state which has transitions for $a$ and $b$ to
itself. The next three lines are calculated as follows:
\begin{itemize}
\item Suppose you calculate the entry for the $a$-transition for state
$\{0\}$. Look for all states in the NFA that can be reached by such
a transition from this state; this is only state $0$; therefore from
state $\{0\}$ we can go to state $\{0\}$ via an $a$-transition.
\item Do the same for the $b$-transition; you can reach states $0$ and
$1$ in the NFA; therefore in the DFA we can go from state $\{0\}$ to
state $\{0,1\}$ via an $b$-transition.
\item Continue with the states $\{1\}$ and $\{2\}$.
\item Once you filled in the transitions for `simple' state, you only
have to build the union for the compound states $\{0,1\}$, $\{0,2\}$
and so on. For example for $\{0,1\}$ you take the union of line
$\{0\}$ and line $\{1\}$, which gives $\{0,2\}$ for $a$, and
$\{0,1,2\}$ for $b$. And so on.
\item The starting state of the DFA can be calculated from the
starting states of the NFA, that is in this case $0$. But in general
there can be many starting states in the NFA and you would take the
corresponding subset as \emph{the} starting state of the DFA.
\item The accepting states in the DFA are given by all sets that
contain a $2$, which is the only accpting state in this NFA. But
again in general if the subset contains an accepting state from the
NFA, then the corresponding state in the DFA is accepting as well.
\end{itemize}
\noindent This completes the subset construction. The corresponding
DFA for the NFA shown above is:
\begin{center}
\begin{tikzpicture}[scale=0.8,>=stealth',very thick,
every state/.style={minimum size=0pt,
draw=blue!50,very thick,fill=blue!20},
baseline=0mm]
\node[state,initial] (q0) {$0$};
\node[state] (q01) [right=of q0] {$0,1$};
\node[state,accepting] (q02) [below=of q01] {$0,2$};
\node[state,accepting] (q012) [right=of q02] {$0,1,2$};
\node[state] (q1) [below=0.5cm of q0] {$1$};
\node[state,accepting] (q2) [below=1cm of q1] {$2$};
\node[state] (qn) [below left=1cm of q2] {$\{\}$};
\node[state,accepting] (q12) [below right=1cm of q2] {$1,2$};
\path[->] (q0) edge node [above] {$b$} (q01);
\path[->] (q01) edge node [above] {$b$} (q012);
\path[->] (q0) edge [loop above] node {$a$} ();
\path[->] (q012) edge [loop right] node {$b$} ();
\path[->] (q012) edge node [below] {$a$} (q02);
\path[->] (q02) edge node [below] {$a$} (q0);
\path[->] (q01) edge [bend left] node [left] {$a$} (q02);
\path[->] (q02) edge [bend left] node [right] {$b$} (q01);
\path[->] (q1) edge node [left] {$a,b$} (q2);
\path[->] (q12) edge node [right] {$a, b$} (q2);
\path[->] (q2) edge node [right] {$a, b$} (qn);
\path[->] (qn) edge [loop left] node {$a,b$} ();
\end{tikzpicture}
\end{center}
\noindent
Please check that this is indeed a DFA. The big question is whether
this DFA can recognise the same language as the NFA we started with.
I let you ponder about this question.
There are also two points to note: One is that very often the
resulting DFA contains a number of ``dead'' states that are never
reachable from the starting state. This is obvious in this case, where
state $\{1\}$, $\{2\}$, $\{1,2\}$ and $\{\}$ can never be reached from
the starting state. In effect the DFA in this example is not a
\emph{minimal} DFA (more about this in a minute). Such dead states can
be safely removed without changing the language that is recognised by
the DFA. Another point is that in some cases, however, the subset
construction produces a DFA that does \emph{not} contain any dead
states\ldots{}and further calculates a minimal DFA. Which in turn
means that in some cases the number of states can by going from NFAs
to DFAs exponentially increase, namely by $2^n$ (which is the number
of subsets you can form for $n$ states). This blow up the number of
states in the DFA is again bad news for how quickly you can decide
whether a string is accepted by a DFA or not. So the caveat with DFAs
is that they might make the task of finding the next state trival, but
might require $2^n$ times as many states as a NFA.\bigskip
Lastly, can we
{\small\begin{lstlisting}[language=Scala]
def subset[A, C](nfa: NFA[A, C]) : DFA[Set[A], C] = {
DFA(nfa.starts,
{ case (qs, c) => nfa.nexts(qs, c) },
_.exists(nfa.fins))
}
\end{lstlisting}}
\subsection*{DFA Minimisation}
As seen in the subset construction, the translation
of a NFA to a DFA can result in a rather ``inefficient''
DFA. Meaning there are states that are not needed. A
DFA can be \emph{minimised} by the following algorithm:
\begin{enumerate}
\item Take all pairs $(q, p)$ with $q \not= p$
\item Mark all pairs that accepting and non-accepting states
\item For all unmarked pairs $(q, p)$ and all characters $c$
test whether
\begin{center}
$(\delta(q, c), \delta(p,c))$
\end{center}
are marked. If there is one, then also mark $(q, p)$.
\item Repeat last step until no change.
\item All unmarked pairs can be merged.
\end{enumerate}
\noindent To illustrate this algorithm, consider the following
DFA.
\begin{center}
\begin{tikzpicture}[>=stealth',very thick,auto,
every state/.style={minimum size=0pt,
inner sep=2pt,draw=blue!50,very thick,
fill=blue!20}]
\node[state,initial] (Q_0) {$Q_0$};
\node[state] (Q_1) [right=of Q_0] {$Q_1$};
\node[state] (Q_2) [below right=of Q_0] {$Q_2$};
\node[state] (Q_3) [right=of Q_2] {$Q_3$};
\node[state, accepting] (Q_4) [right=of Q_1] {$Q_4$};
\path[->] (Q_0) edge node [above] {$a$} (Q_1);
\path[->] (Q_1) edge node [above] {$a$} (Q_4);
\path[->] (Q_4) edge [loop right] node {$a, b$} ();
\path[->] (Q_3) edge node [right] {$a$} (Q_4);
\path[->] (Q_2) edge node [above] {$a$} (Q_3);
\path[->] (Q_1) edge node [right] {$b$} (Q_2);
\path[->] (Q_0) edge node [above] {$b$} (Q_2);
\path[->] (Q_2) edge [loop left] node {$b$} ();
\path[->] (Q_3) edge [bend left=95, looseness=1.3] node
[below] {$b$} (Q_0);
\end{tikzpicture}
\end{center}
\noindent In Step 1 and 2 we consider essentially a triangle
of the form
\begin{center}
\begin{tikzpicture}[scale=0.6,line width=0.8mm]
\draw (0,0) -- (4,0);
\draw (0,1) -- (4,1);
\draw (0,2) -- (3,2);
\draw (0,3) -- (2,3);
\draw (0,4) -- (1,4);
\draw (0,0) -- (0, 4);
\draw (1,0) -- (1, 4);
\draw (2,0) -- (2, 3);
\draw (3,0) -- (3, 2);
\draw (4,0) -- (4, 1);
\draw (0.5,-0.5) node {$Q_0$};
\draw (1.5,-0.5) node {$Q_1$};
\draw (2.5,-0.5) node {$Q_2$};
\draw (3.5,-0.5) node {$Q_3$};
\draw (-0.5, 3.5) node {$Q_1$};
\draw (-0.5, 2.5) node {$Q_2$};
\draw (-0.5, 1.5) node {$Q_3$};
\draw (-0.5, 0.5) node {$Q_4$};
\draw (0.5,0.5) node {\large$\star$};
\draw (1.5,0.5) node {\large$\star$};
\draw (2.5,0.5) node {\large$\star$};
\draw (3.5,0.5) node {\large$\star$};
\end{tikzpicture}
\end{center}
\noindent where the lower row is filled with stars, because in
the corresponding pairs there is always one state that is
accepting ($Q_4$) and a state that is non-accepting (the other
states).
Now in Step 3 we need to fill in more stars according whether
one of the next-state pairs are marked. We have to do this
for every unmarked field until there is no change anymore.
This gives the triangle
\begin{center}
\begin{tikzpicture}[scale=0.6,line width=0.8mm]
\draw (0,0) -- (4,0);
\draw (0,1) -- (4,1);
\draw (0,2) -- (3,2);
\draw (0,3) -- (2,3);
\draw (0,4) -- (1,4);
\draw (0,0) -- (0, 4);
\draw (1,0) -- (1, 4);
\draw (2,0) -- (2, 3);
\draw (3,0) -- (3, 2);
\draw (4,0) -- (4, 1);
\draw (0.5,-0.5) node {$Q_0$};
\draw (1.5,-0.5) node {$Q_1$};
\draw (2.5,-0.5) node {$Q_2$};
\draw (3.5,-0.5) node {$Q_3$};
\draw (-0.5, 3.5) node {$Q_1$};
\draw (-0.5, 2.5) node {$Q_2$};
\draw (-0.5, 1.5) node {$Q_3$};
\draw (-0.5, 0.5) node {$Q_4$};
\draw (0.5,0.5) node {\large$\star$};
\draw (1.5,0.5) node {\large$\star$};
\draw (2.5,0.5) node {\large$\star$};
\draw (3.5,0.5) node {\large$\star$};
\draw (0.5,1.5) node {\large$\star$};
\draw (2.5,1.5) node {\large$\star$};
\draw (0.5,3.5) node {\large$\star$};
\draw (1.5,2.5) node {\large$\star$};
\end{tikzpicture}
\end{center}
\noindent which means states $Q_0$ and $Q_2$, as well as $Q_1$
and $Q_3$ can be merged. This gives the following minimal DFA
\begin{center}
\begin{tikzpicture}[>=stealth',very thick,auto,
every state/.style={minimum size=0pt,
inner sep=2pt,draw=blue!50,very thick,
fill=blue!20}]
\node[state,initial] (Q_02) {$Q_{0, 2}$};
\node[state] (Q_13) [right=of Q_02] {$Q_{1, 3}$};
\node[state, accepting] (Q_4) [right=of Q_13]
{$Q_{4\phantom{,0}}$};
\path[->] (Q_02) edge [bend left] node [above] {$a$} (Q_13);
\path[->] (Q_13) edge [bend left] node [below] {$b$} (Q_02);
\path[->] (Q_02) edge [loop below] node {$b$} ();
\path[->] (Q_13) edge node [above] {$a$} (Q_4);
\path[->] (Q_4) edge [loop above] node {$a, b$} ();
\end{tikzpicture}
\end{center}
\subsection*{Brzozowski's Method}
As said before, we can also go into the other direction---from
DFAs to regular expressions. Brzozowski's method calculates
a regular expression using familiar transformations for
solving equational systems. Consider the DFA:
\begin{center}
\begin{tikzpicture}[scale=1.5,>=stealth',very thick,auto,
every state/.style={minimum size=0pt,
inner sep=2pt,draw=blue!50,very thick,
fill=blue!20}]
\node[state, initial] (q0) at ( 0,1) {$Q_0$};
\node[state] (q1) at ( 1,1) {$Q_1$};
\node[state, accepting] (q2) at ( 2,1) {$Q_2$};
\path[->] (q0) edge[bend left] node[above] {$a$} (q1)
(q1) edge[bend left] node[above] {$b$} (q0)
(q2) edge[bend left=50] node[below] {$b$} (q0)
(q1) edge node[above] {$a$} (q2)
(q2) edge [loop right] node {$a$} ()
(q0) edge [loop below] node {$b$} ();
\end{tikzpicture}
\end{center}
\noindent for which we can set up the following equational
system
\begin{eqnarray}
Q_0 & = & \ONE + Q_0\,b + Q_1\,b + Q_2\,b\\
Q_1 & = & Q_0\,a\\
Q_2 & = & Q_1\,a + Q_2\,a
\end{eqnarray}
\noindent There is an equation for each node in the DFA. Let
us have a look how the right-hand sides of the equations are
constructed. First have a look at the second equation: the
left-hand side is $Q_1$ and the right-hand side $Q_0\,a$. The
right-hand side is essentially all possible ways how to end up
in node $Q_1$. There is only one incoming edge from $Q_0$ consuming
an $a$. Therefore the right hand side is this
state followed by character---in this case $Q_0\,a$. Now lets
have a look at the third equation: there are two incoming
edges for $Q_2$. Therefore we have two terms, namely $Q_1\,a$ and
$Q_2\,a$. These terms are separated by $+$. The first states
that if in state $Q_1$ consuming an $a$ will bring you to
$Q_2$, and the second that being in $Q_2$ and consuming an $a$
will make you stay in $Q_2$. The right-hand side of the
first equation is constructed similarly: there are three
incoming edges, therefore there are three terms. There is
one exception in that we also ``add'' $\ONE$ to the
first equation, because it corresponds to the starting state
in the DFA.
Having constructed the equational system, the question is
how to solve it? Remarkably the rules are very similar to
solving usual linear equational systems. For example the
second equation does not contain the variable $Q_1$ on the
right-hand side of the equation. We can therefore eliminate
$Q_1$ from the system by just substituting this equation
into the other two. This gives
\begin{eqnarray}
Q_0 & = & \ONE + Q_0\,b + Q_0\,a\,b + Q_2\,b\\
Q_2 & = & Q_0\,a\,a + Q_2\,a
\end{eqnarray}
\noindent where in Equation (4) we have two occurrences
of $Q_0$. Like the laws about $+$ and $\cdot$, we can simplify
Equation (4) to obtain the following two equations:
\begin{eqnarray}
Q_0 & = & \ONE + Q_0\,(b + a\,b) + Q_2\,b\\
Q_2 & = & Q_0\,a\,a + Q_2\,a
\end{eqnarray}
\noindent Unfortunately we cannot make any more progress with
substituting equations, because both (6) and (7) contain the
variable on the left-hand side also on the right-hand side.
Here we need to now use a law that is different from the usual
laws about linear equations. It is called \emph{Arden's rule}.
It states that if an equation is of the form $q = q\,r + s$
then it can be transformed to $q = s\, r^*$. Since we can
assume $+$ is symmetric, Equation (7) is of that form: $s$ is
$Q_0\,a\,a$ and $r$ is $a$. That means we can transform
(7) to obtain the two new equations
\begin{eqnarray}
Q_0 & = & \ONE + Q_0\,(b + a\,b) + Q_2\,b\\
Q_2 & = & Q_0\,a\,a\,(a^*)
\end{eqnarray}
\noindent Now again we can substitute the second equation into
the first in order to eliminate the variable $Q_2$.
\begin{eqnarray}
Q_0 & = & \ONE + Q_0\,(b + a\,b) + Q_0\,a\,a\,(a^*)\,b
\end{eqnarray}
\noindent Pulling $Q_0$ out as a single factor gives:
\begin{eqnarray}
Q_0 & = & \ONE + Q_0\,(b + a\,b + a\,a\,(a^*)\,b)
\end{eqnarray}
\noindent This equation is again of the form so that we can
apply Arden's rule ($r$ is $b + a\,b + a\,a\,(a^*)\,b$ and $s$
is $\ONE$). This gives as solution for $Q_0$ the following
regular expression:
\begin{eqnarray}
Q_0 & = & \ONE\,(b + a\,b + a\,a\,(a^*)\,b)^*
\end{eqnarray}
\noindent Since this is a regular expression, we can simplify
away the $\ONE$ to obtain the slightly simpler regular
expression
\begin{eqnarray}
Q_0 & = & (b + a\,b + a\,a\,(a^*)\,b)^*
\end{eqnarray}
\noindent
Now we can unwind this process and obtain the solutions
for the other equations. This gives:
\begin{eqnarray}
Q_0 & = & (b + a\,b + a\,a\,(a^*)\,b)^*\\
Q_1 & = & (b + a\,b + a\,a\,(a^*)\,b)^*\,a\\
Q_2 & = & (b + a\,b + a\,a\,(a^*)\,b)^*\,a\,a\,(a)^*
\end{eqnarray}
\noindent Finally, we only need to ``add'' up the equations
which correspond to a terminal state. In our running example,
this is just $Q_2$. Consequently, a regular expression
that recognises the same language as the automaton is
\[
(b + a\,b + a\,a\,(a^*)\,b)^*\,a\,a\,(a)^*
\]
\noindent You can somewhat crosscheck your solution
by taking a string the regular expression can match and
and see whether it can be matched by the automaton.
One string for example is $aaa$ and \emph{voila} this
string is also matched by the automaton.
We should prove that Brzozowski's method really produces
an equivalent regular expression for the automaton. But
for the purposes of this module, we omit this.
\subsection*{Regular Languages}
Given the constructions in the previous sections we obtain
the following overall picture:
\begin{center}
\begin{tikzpicture}
\node (rexp) {\bf Regexps};
\node (nfa) [right=of rexp] {\bf NFAs};
\node (dfa) [right=of nfa] {\bf DFAs};
\node (mdfa) [right=of dfa] {\bf\begin{tabular}{c}minimal\\ DFAs\end{tabular}};
\path[->,line width=1mm] (rexp) edge node [above=4mm, black] {\begin{tabular}{c@{\hspace{9mm}}}Thompson's\\[-1mm] construction\end{tabular}} (nfa);
\path[->,line width=1mm] (nfa) edge node [above=4mm, black] {\begin{tabular}{c}subset\\[-1mm] construction\end{tabular}}(dfa);
\path[->,line width=1mm] (dfa) edge node [below=5mm, black] {minimisation} (mdfa);
\path[->,line width=1mm] (dfa) edge [bend left=45] node [below] {\begin{tabular}{l}Brzozowski's\\ method\end{tabular}} (rexp);
\end{tikzpicture}
\end{center}
\noindent By going from regular expressions over NFAs to DFAs,
we can always ensure that for every regular expression there
exists a NFA and a DFA that can recognise the same language.
Although we did not prove this fact. Similarly by going from
DFAs to regular expressions, we can make sure for every DFA
there exists a regular expression that can recognise the same
language. Again we did not prove this fact.
The interesting conclusion is that automata and regular
expressions can recognise the same set of languages:
\begin{quote} A language is \emph{regular} iff there exists a
regular expression that recognises all its strings.
\end{quote}
\noindent or equivalently
\begin{quote} A language is \emph{regular} iff there exists an
automaton that recognises all its strings.
\end{quote}
\noindent So for deciding whether a string is recognised by a
regular expression, we could use our algorithm based on
derivatives or NFAs or DFAs. But let us quickly look at what
the differences mean in computational terms. Translating a
regular expression into a NFA gives us an automaton that has
$O(n)$ states---that means the size of the NFA grows linearly
with the size of the regular expression. The problem with NFAs
is that the problem of deciding whether a string is accepted
or not is computationally not cheap. Remember with NFAs we
have potentially many next states even for the same input and
also have the silent $\epsilon$-transitions. If we want to
find a path from the starting state of a NFA to an accepting
state, we need to consider all possibilities. In Ruby and
Python this is done by a depth-first search, which in turn
means that if a ``wrong'' choice is made, the algorithm has to
backtrack and thus explore all potential candidates. This is
exactly the reason why Ruby and Python are so slow for evil
regular expressions. An alternative to the potentially slow
depth-first search is to explore the search space in a
breadth-first fashion, but this might incur a big memory
penalty.
To avoid the problems with NFAs, we can translate them
into DFAs. With DFAs the problem of deciding whether a
string is recognised or not is much simpler, because in
each state it is completely determined what the next
state will be for a given input. So no search is needed.
The problem with this is that the translation to DFAs
can explode exponentially the number of states. Therefore when
this route is taken, we definitely need to minimise the
resulting DFAs in order to have an acceptable memory
and runtime behaviour. But remember the subset construction
in the worst case explodes the number of states by $2^n$.
Effectively also the translation to DFAs can incur a big
runtime penalty.
But this does not mean that everything is bad with automata.
Recall the problem of finding a regular expressions for the
language that is \emph{not} recognised by a regular
expression. In our implementation we added explicitly such a
regular expressions because they are useful for recognising
comments. But in principle we did not need to. The argument
for this is as follows: take a regular expression, translate
it into a NFA and then a DFA that both recognise the same
language. Once you have the DFA it is very easy to construct
the automaton for the language not recognised by a DFA. If
the DFA is completed (this is important!), then you just need
to exchange the accepting and non-accepting states. You can
then translate this DFA back into a regular expression and
that will be the regular expression that can match all strings
the original regular expression could \emph{not} match.
It is also interesting that not all languages are regular. The
most well-known example of a language that is not regular
consists of all the strings of the form
\[a^n\,b^n\]
\noindent meaning strings that have the same number of $a$s
and $b$s. You can try, but you cannot find a regular
expression for this language and also not an automaton. One
can actually prove that there is no regular expression nor
automaton for this language, but again that would lead us too
far afield for what we want to do in this module.
%\section*{Further Reading}
%Compare what a ``human expert'' would create as an automaton for the
%regular expression $a\cdot (b + c)^*$ and what the Thomson
%algorithm generates.
%http://www.inf.ed.ac.uk/teaching/courses/ct/
\end{document}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: t
%%% End: