\documentclass{article}\usepackage{../style}\usepackage{../langs}\usepackage{../graphics}\begin{document}\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017}\section*{Handout 3 (Finite Automata)}Every formal language and compiler course I know of bombards you firstwith automata and then to a much, much smaller extend with regularexpressions. As you can see, this course is turned upside down:regular expressions come first. The reason is that regular expressionsare easier to reason about and the notion of derivatives, althoughalready quite old, only became more widely known ratherrecently. Still, let us in this lecture have a closer look at automataand their relation to regular expressions. This will help us withunderstanding why the regular expression matchers in Python, Ruby andJava are so slow with certain regular expressions. On the way we willalso see what are the limitations of regularexpressions. Unfortunately, they cannot be used for \emph{everything}.\subsection*{Deterministic Finite Automata}Lets start\ldots the central definition is:\medskip\noindent A \emph{deterministic finite automaton} (DFA), say $A$, isgiven by a five-tuple written ${\cal A}(\varSigma, Qs, Q_0, F, \delta)$ where\begin{itemize}\item $\varSigma$ is an alphabet, \item $Qs$ is a finite set of states,\item $Q_0 \in Qs$ is the start state,\item $F \subseteq Qs$ are the accepting states, and\item $\delta$ is the transition function.\end{itemize}\noindent I am sure you have seen this definition alreadybefore. The transition function determines how to ``transition'' fromone state to the next state with respect to a character. We have theassumption that these transition functions do not need to be definedeverywhere: so it can be the case that given a character there is nonext state, in which case we need to raise a kind of ``failureexception''. That means actually we have \emph{partial} functions astransitions---see the Scala implementation for DFAs later on. Atypical example of a DFA is\begin{center}\begin{tikzpicture}[>=stealth',very thick,auto, every state/.style={minimum size=0pt, inner sep=2pt,draw=blue!50,very thick, fill=blue!20},scale=2]\node[state,initial] (Q_0) {$Q_0$};\node[state] (Q_1) [right=of Q_0] {$Q_1$};\node[state] (Q_2) [below right=of Q_0] {$Q_2$};\node[state] (Q_3) [right=of Q_2] {$Q_3$};\node[state, accepting] (Q_4) [right=of Q_1] {$Q_4$};\path[->] (Q_0) edge node [above] {$a$} (Q_1);\path[->] (Q_1) edge node [above] {$a$} (Q_4);\path[->] (Q_4) edge [loop right] node {$a, b$} ();\path[->] (Q_3) edge node [right] {$a$} (Q_4);\path[->] (Q_2) edge node [above] {$a$} (Q_3);\path[->] (Q_1) edge node [right] {$b$} (Q_2);\path[->] (Q_0) edge node [above] {$b$} (Q_2);\path[->] (Q_2) edge [loop left] node {$b$} ();\path[->] (Q_3) edge [bend left=95, looseness=1.3] node [below] {$b$} (Q_0);\end{tikzpicture}\end{center}\noindent In this graphical notation, the accepting state $Q_4$ isindicated with double circles. Note that there can be more than oneaccepting state. It is also possible that a DFA has no acceptingstate at all, or that the starting state is also an acceptingstate. In the case above the transition function is defined everywhereand can also be given as a table as follows:\[\begin{array}{lcl}(Q_0, a) &\rightarrow& Q_1\\(Q_0, b) &\rightarrow& Q_2\\(Q_1, a) &\rightarrow& Q_4\\(Q_1, b) &\rightarrow& Q_2\\(Q_2, a) &\rightarrow& Q_3\\(Q_2, b) &\rightarrow& Q_2\\(Q_3, a) &\rightarrow& Q_4\\(Q_3, b) &\rightarrow& Q_0\\(Q_4, a) &\rightarrow& Q_4\\(Q_4, b) &\rightarrow& Q_4\\\end{array}\]\noindentPlease check that this table represents the same transition functionas the graph above.We need to define the notion of what language is accepted byan automaton. For this we lift the transition function$\delta$ from characters to strings as follows:\[\begin{array}{lcl}\widehat{\delta}(q, []) & \dn & q\\\widehat{\delta}(q, c\!::\!s) & \dn & \widehat{\delta}(\delta(q, c), s)\\\end{array}\]\noindent This lifted transition function is often called\emph{delta-hat}. Given a string, we start in the starting state andtake the first character of the string, follow to the next state, thentake the second character and so on. Once the string is exhausted andwe end up in an accepting state, then this string is accepted by theautomaton. Otherwise it is not accepted. This also means that if alongthe way we hit the case where the transition function $\delta$ is notdefined, we need to raise an error. In our implementation we will dealwith this case elegantly by using Scala's \texttt{Try}. Summing up: astring $s$ is in the \emph{language accepted by the automaton} ${\cal A}(\varSigma, Q, Q_0, F, \delta)$ iff\[\widehat{\delta}(Q_0, s) \in F \]\noindent I let you think about a definition that describes the set ofall strings accepted by a deterministic finite automaton.\begin{figure}[p]\small\lstinputlisting[numbers=left]{../progs/display/dfa.scala}\caption{An implementation of DFAs in Scala using partial functions. Note some subtleties: \texttt{deltas} implements the delta-hat construction by lifting the (partial) transition function to lists of characters. Since \texttt{delta} is given as a partial function, it can obviously go ``wrong'' in which case the \texttt{Try} in \texttt{accepts} catches the error and returns \texttt{false}---that means the string is not accepted. The example \texttt{delta} in Line 28--38 implements the DFA example shown earlier in the handout.\label{dfa}}\end{figure}My take on an implementation of DFAs in Scala is given inFigure~\ref{dfa}. As you can see, there are many features of themathematical definition that are quite closely reflected in thecode. In the DFA-class, there is a starting state, called\texttt{start}, with the polymorphic type \texttt{A}. There is apartial function \texttt{delta} for specifying the transitions---thesepartial functions take a state (of polymorphic type \texttt{A}) and aninput (of polymorphic type \texttt{C}) and produce a new state (oftype \texttt{A}). For the moment it is OK to assume that \texttt{A} issome arbitrary type for states and the input is just characters. (Thereason for not having concrete types, but polymorphic types for thestates and the input of DFAs will become clearer later on.)The DFA-class has also an argument for specifying final states. In theimplementation it is not a set of states, as in the mathematicaldefinition, but a function from states to booleans (this function issupposed to return true whenever a state is final; falseotherwise). While this boolean function is different from the sets ofstates, Scala allows us to use sets for such functions (see Line 40 wherethe DFA is initialised). Again it will become clear later on why I usefunctions for final states, rather than sets.The most important point in the implementation is that I use Scala'spartial functions for representing the transitions; alternatives wouldhave been \texttt{Maps} or even \texttt{Lists}. One of the mainadvantages of using partial functions is that transitions can be quitenicely defined by a series of \texttt{case} statements (see Lines 28-- 38 for an example). If you need to represent an automaton with asink state (catch-all-state), you can use Scala's pattern matching andwrite something like{\small\begin{lstlisting}[language=Scala]abstract class State...case object Sink extends Stateval delta : (State, Char) :=> State = { case (S0, 'a') => S1 case (S1, 'a') => S2 case _ => Sink }\end{lstlisting}} \noindent I let you think what the corresponding DFA looks like in thegraph notation. Also, I suggest you to tinker with the Scala code inorder to define the DFA that does not accept any string at all. Finally, I let you ponder whether this is a good implementation ofDFAs or not. In doing so I hope you notice that the $\varSigma$ and$Qs$ components (the alphabet and the set of \emph{finite} states,respectively) are missing from the class definition. This means thatthe implementation allows you to do some ``fishy'' things you are notmeant to do with DFAs. Which fishy things could that be?\subsection*{Non-Deterministic Finite Automata}Remember we want to find out what the relation is between regularexpressions and automata. To do this with DFAs is a bit unwieldy.While with DFAs it is always clear that given a state and a characterwhat the next state is (potentially none), it will be convenient torelax this restriction. That means we allow states to have severalpotential successor states. We even allow more than one startingstate. The resulting construction is called a \emph{Non-Deterministic Finite Automaton} (NFA) given also as a five-tuple ${\cal A}(\varSigma, Qs, Q_{0s}, F, \rho)$ where\begin{itemize}\item $\varSigma$ is an alphabet, \item $Qs$ is a finite set of states\item $Q_{0s}$ is a set of start states ($Q_{0s} \subseteq Qs$)\item $F$ are some accepting states with $F \subseteq Qs$, and\item $\rho$ is a transition relation.\end{itemize}\noindentA typical example of a NFA is% A NFA for (ab* + b)*a\begin{center}\begin{tikzpicture}[>=stealth',very thick, auto, every state/.style={minimum size=0pt,inner sep=3pt, draw=blue!50,very thick,fill=blue!20},scale=2]\node[state,initial] (Q_0) {$Q_0$};\node[state] (Q_1) [right=of Q_0] {$Q_1$};\node[state, accepting] (Q_2) [right=of Q_1] {$Q_2$};\path[->] (Q_0) edge [loop above] node {$b$} ();\path[<-] (Q_0) edge node [below] {$b$} (Q_1);\path[->] (Q_0) edge [bend left] node [above] {$a$} (Q_1);\path[->] (Q_0) edge [bend right] node [below] {$a$} (Q_2);\path[->] (Q_1) edge [loop above] node {$a,b$} ();\path[->] (Q_1) edge node [above] {$a$} (Q_2);\end{tikzpicture}\end{center}\noindentThis NFA happens to have only one starting state, but in general therecould be more than one. Notice that in state $Q_0$ we might go tostate $Q_1$ \emph{or} to state $Q_2$ when receiving an $a$. Similarlyin state $Q_1$ and receiving an $a$, we can stay in state $Q_1$\emph{or} go to $Q_2$. This kind of choice is not allowed withDFAs. The downside of this choice in NFAs is that when it comes todeciding whether a string is accepted by a NFA we potentially have toexplore all possibilities. I let you think which strings the above NFAaccepts.There are a number of additional points you should note aboutNFAs. Every DFA is a NFA, but not vice versa. The $\rho$ in NFAs is atransition \emph{relation} (DFAs have a transition function). Thedifference between a function and a relation is that a function hasalways a single output, while a relation gives, roughly speaking,several outputs. Look again at the NFA above: if you are currently inthe state $Q_1$ and you read a character $b$, then you can transitionto either $Q_0$ \emph{or} $Q_2$. Which route, or output, you take isnot determined. This non-determinism can be represented by arelation.My implementation of NFAs in Scala is shown in Figure~\ref{nfa}.Perhaps interestingly, I do not actually use relations for my NFAs,but use transition functions that return sets of states. DFAs havepartial transition functions that return a single state; my NFAsreturn a set of states. I let you think about this representation forNFA-transitions and how it corresponds to the relations used in themathematical definition of NFAs. An example of a transition functionin Scala for the NFA shown above is{\small\begin{lstlisting}[language=Scala]val nfa_delta : (State, Char) :=> Set[State] = { case (Q0, 'a') => Set(Q1, Q2) case (Q0, 'b') => Set(Q0) case (Q1, 'a') => Set(Q1, Q2) case (Q1, 'b') => Set(Q0, Q1) }\end{lstlisting}} Like in the mathematical definition, \texttt{starts} is inNFAs a set of states; \texttt{fins} is again a function from states tobooleans. The \texttt{next} function calculates the set of next statesreachable from a single state \texttt{q} by a character~\texttt{c}. Incase there is no such state---the partial transition function isundefined---the empty set is returned (see function\texttt{applyOrElse} in Lines 9 and 10). The function \texttt{nexts}just lifts this function to sets of states.\begin{figure}[p]\small\lstinputlisting[numbers=left]{../progs/display/nfa.scala}\caption{A Scala implementation of NFAs using partial functions. Notice that the function \texttt{accepts} implements the acceptance of a string in a breadth-first search fashion. This can be a costly way of deciding whether a string is accepted or not in applications that need to handle large NFAs and large inputs.\label{nfa}}\end{figure}Look very careful at the \texttt{accepts} and \texttt{deltas}functions in NFAs and remember that when accepting a string by a NFAwe might have to explore all possible transitions (recall which stateto go to is not unique anymore with NFAs\ldots{}we need to explorepotentially all next states). The implementation achieves thisexploration through a \emph{breadth-first search}. This is fine forsmall NFAs, but can lead to real memory problems when the NFAs arebigger and larger strings need to be processed. As result, someregular expression matching engines resort to a \emph{depth-first search} with \emph{backtracking} in unsuccessful cases. In ourimplementation we can implement a depth-first version of\texttt{accepts} using Scala's \texttt{exists}-function as follows:{\small\begin{lstlisting}[language=Scala]def search(q: A, s: List[C]) : Boolean = s match { case Nil => fins(q) case c::cs => next(q, c).exists(search(_, cs)) }def accepts2(s: List[C]) : Boolean = starts.exists(search(_, s))\end{lstlisting}}\noindentThis depth-first way of exploration seems to work quite efficiently inmany examples and is much less of a strain on memory. The problem isthat the backtracking can get ``catastrophic'' in someexamples---remember the catastrophic backtracking from earlierlectures. This depth-first search with backtracking is the reason forthe abysmal performance of some regular expression matchings in Java,Ruby and Python. I like to show you this in the next two sections.\subsection*{Epsilon NFAs}In order to get an idea what calculations are performed by Java \&friends, we need a method for transforming a regular expression intoan automaton. This automaton should accept exactly those strings thatare accepted by the regular expression. The simplest and mostwell-known method for this is called \emph{Thompson Construction},after the Turing Award winner Ken Thompson. This method is byrecursion over regular expressions and depends on the non-determinismin NFAs described in the previous section. You will see shortly whythis construction works well with NFAs, but is not so straightforwardwith DFAs.Unfortunately we are still one step away from our intended targetthough---because this construction uses a version of NFAs that allows``silent transitions''. The idea behind silent transitions is thatthey allow us to go from one state to the next without having toconsume a character. We label such silent transition with the letter$\epsilon$ and call the automata $\epsilon$NFAs. Two typical examplesof $\epsilon$NFAs are:\begin{center}\begin{tabular}[t]{c@{\hspace{9mm}}c}\begin{tikzpicture}[>=stealth',very thick, every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]\node[state,initial] (Q_0) {$Q_0$};\node[state] (Q_1) [above=of Q_0] {$Q_1$};\node[state, accepting] (Q_2) [below=of Q_0] {$Q_2$};\path[->] (Q_0) edge node [left] {$\epsilon$} (Q_1);\path[->] (Q_0) edge node [left] {$\epsilon$} (Q_2);\path[->] (Q_0) edge [loop right] node {$a$} ();\path[->] (Q_1) edge [loop right] node {$a$} ();\path[->] (Q_2) edge [loop right] node {$b$} ();\end{tikzpicture} &\raisebox{20mm}{\begin{tikzpicture}[>=stealth',very thick, every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]\node[state,initial] (r_1) {$R_1$};\node[state] (r_2) [above=of r_1] {$R_2$};\node[state, accepting] (r_3) [right=of r_1] {$R_3$};\path[->] (r_1) edge node [below] {$b$} (r_3);\path[->] (r_2) edge [bend left] node [above] {$a$} (r_3);\path[->] (r_1) edge [bend left] node [left] {$\epsilon$} (r_2);\path[->] (r_2) edge [bend left] node [right] {$a$} (r_1);\end{tikzpicture}}\end{tabular}\end{center}\noindentConsider the $\epsilon$NFA on the left-hand side: the$\epsilon$-transitions mean you do not have to ``consume'' any part ofthe input string, but ``silently'' change to a different state. Inthis example, if you are in the starting state $Q_0$, you can silentlymove either to $Q_1$ or $Q_2$. You can see that once you are in $Q_1$,respectively $Q_2$, you cannot ``go back'' to the other states. So itseems allowing $\epsilon$-transitions is a rather substantialextension to NFAs. On first appearances, $\epsilon$-transitions mighteven look rather strange, or even silly. To start with, silenttransitions make the decision whether a string is accepted by anautomaton even harder: with $\epsilon$NFAs we have to look whether wecan do first some $\epsilon$-transitions and then do a``proper''-transition; and after any ``proper''-transition we againhave to check whether we can do again some silent transitions. Evenworse, if there is a silent transition pointing back to the samestate, then we have to be careful our decision procedure for stringsdoes not loop (remember the depth-first search for exploring allstates).The obvious question is: Do we get anything in return for this hasslewith silent transitions? Well, we still have to work for it\ldotsunfortunately. If we were to follow the many textbooks on thesubject, we would now start with defining what $\epsilon$NFAsare---that would require extending the transition relation ofNFAs. Next, we would show that the $\epsilon$NFAs are equivalent toNFAs and so on. Once we have done all this on paper, we would need toimplement $\epsilon$NFAs. Lets try to take a shortcut instead. We arenot really interested in $\epsilon$NFAs; they are only a convenienttool for translating regular expressions into automata. So we are notgoing to implementing them explicitly, but translate them immediatelyinto NFAs (in a sense $\epsilon$NFAs are just a convenient API forlazy people ;o). How does the translation work? Well we have to findall transitions of the form\[q\stackrel{\epsilon}{\longrightarrow}\ldots\stackrel{\epsilon}{\longrightarrow}\;\stackrel{a}{\longrightarrow}\;\stackrel{\epsilon}{\longrightarrow}\ldots\stackrel{\epsilon}{\longrightarrow} q'\]\noindent where somewhere in the ``middle'' is an $a$-transition. Wereplace them with $q \stackrel{a}{\longrightarrow} q'$. Doing this tothe $\epsilon$NFA on the right-hand side above gives the NFA\begin{center}\begin{tikzpicture}[>=stealth',very thick, every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]\node[state,initial] (r_1) {$R_1$};\node[state] (r_2) [above=of r_1] {$R_2$};\node[state, accepting] (r_3) [right=of r_1] {$R_3$};\path[->] (r_1) edge node [above] {$b$} (r_3);\path[->] (r_2) edge [bend left] node [above] {$a$} (r_3);\path[->] (r_1) edge [bend left] node [left] {$a$} (r_2);\path[->] (r_2) edge [bend left] node [right] {$a$} (r_1);\path[->] (r_1) edge [loop below] node {$a$} ();\path[->] (r_1) edge [bend right] node [below] {$a$} (r_3);\end{tikzpicture}\end{center}\noindent where the single $\epsilon$-transition is replaced bythree additional $a$-transitions. Please do the calculations yourselfand verify that I did not forget any transition.So in what follows, whenever we are given an $\epsilon$NFA we willreplace it by an equivalent NFA. The Scala code for this translationis given in Figure~\ref{enfa}. The main workhorse in this code is afunction that calculates a fixpoint of function (Lines 5--10). Thisfunction is used for ``discovering'' which states are reachable by$\epsilon$-transitions. Once no new state is discovered, a fixpoint isreached. This is used for example when calculating the starting statesof an equivalent NFA (see Line 36): we start with all starting statesof the $\epsilon$NFA and then look for all additional states that canbe reached by $\epsilon$-transitions. We keep on doing this until nonew state can be reached. This is what the $\epsilon$-closure, namedin the code \texttt{ecl}, calculates. Similarly, an accepting state ofthe NFA is when we can reach an accepting state of the $\epsilon$NFAby $\epsilon$-transitions.\begin{figure}[p]\small\lstinputlisting[numbers=left]{../progs/display/enfa.scala}\caption{A Scala function that translates $\epsilon$NFA into NFAs. The transition function of $\epsilon$NFA takes as input an \texttt{Option[C]}. \texttt{None} stands for an $\epsilon$-transition; \texttt{Some(c)} for a ``proper'' transition consuming a character. The functions in Lines 18--26 calculate all states reachable by one or more $\epsilon$-transition for a given set of states. The NFA is constructed in Lines 36--38. Note the interesting commands in Lines 5 and 6: their purpose is to ensure that \texttt{fixpT} is the tail-recursive version of the fixpoint construction; otherwise we would quickly get a stack-overflow exception, even on small examples, due to limitations of the JVM. \label{enfa}}\end{figure}Also look carefully how the transitions of $\epsilon$NFAs areimplemented. The additional possibility of performing silenttransitions is encoded by using \texttt{Option[C]} as the type for the``input''. The \texttt{Some}s stand for ``proper'' transitions wherea character is consumed; \texttt{None} stands for$\epsilon$-transitions. The transition functions for the two$\epsilon$NFAs from the beginning of this section can be defined as{\small\begin{lstlisting}[language=Scala]val enfa_trans1 : (State, Option[Char]) :=> Set[State] = { case (Q0, Some('a')) => Set(Q0) case (Q0, None) => Set(Q1, Q2) case (Q1, Some('a')) => Set(Q1) case (Q2, Some('b')) => Set(Q2) }val enfa_trans2 : (State, Option[Char]) :=> Set[State] = { case (R1, Some('b')) => Set(R3) case (R1, None) => Set(R2) case (R2, Some('a')) => Set(R1, R3) }\end{lstlisting}}\noindentI hope you agree now with my earlier statement that the $\epsilon$NFAsare just an API for NFAs.\subsection*{Thompson Construction}Having the translation of $\epsilon$NFAs to NFAs in place, we canfinally return to the problem of translating regular expressions intoequivalent NFAs. Recall that by equivalent we mean that the NFAsrecognise the same language. Consider the simple regular expressions$\ZERO$, $\ONE$ and $c$. They can be translated into equivalent NFAsas follows:\begin{equation}\mbox{\begin{tabular}[t]{l@{\hspace{10mm}}l}\raisebox{1mm}{$\ZERO$} & \begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]\node[state, initial] (Q_0) {$\mbox{}$};\end{tikzpicture}\\\\\raisebox{1mm}{$\ONE$} & \begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]\node[state, initial, accepting] (Q_0) {$\mbox{}$};\end{tikzpicture}\\\\\raisebox{3mm}{$c$} & \begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]\node[state, initial] (Q_0) {$\mbox{}$};\node[state, accepting] (Q_1) [right=of Q_0] {$\mbox{}$};\path[->] (Q_0) edge node [below] {$c$} (Q_1);\end{tikzpicture}\\\end{tabular}}\label{simplecases}\end{equation}\noindentI let you think whether the NFAs can match exactly those strings theregular expressions can match. To do this translation in code we needa way to construct states ``programatically''...and as an additionalconstraint Scala needs to recognise that these states are being distinct.For this I implemented in Figure~\ref{thompson1} a class\texttt{TState} that includes a counter and a companion object thatincreases this counter whenever a new state is created.\footnote{You might have to read up what \emph{companion objects} do in Scala.}\begin{figure}[p]\small\lstinputlisting[numbers=left]{../progs/display/thompson1.scala}\caption{The first part of the Thompson Construction. Lines 7--16 implement a way of how to create new states that are all distinct by virtue of a counter. This counter is increased in the companion object of \texttt{TState} whenever a new state is created. The code in Lines 24--40 constructs NFAs for the simple regular expressions $\ZERO$, $\ONE$ and $c$. Compare this code with the pictures given in \eqref{simplecases} on Page~\pageref{simplecases}. \label{thompson1}}\end{figure}\begin{figure}[p]\small\lstinputlisting[numbers=left]{../progs/display/thompson2.scala}\caption{The second part of the Thompson Construction implementing the composition of NFAs according to $\cdot$, $+$ and ${}^*$. The implicit class about rich partial functions implements the infix operation \texttt{+++} which combines an $\epsilon$NFA transition with a NFA transition (both are given as partial functions---but with different type!).\label{thompson2}}\end{figure}The case for the sequence regular expression $r_1 \cdot r_2$ is a bit morecomplicated: Say, we are given by recursion two NFAs representing the regularexpressions $r_1$ and $r_2$ respectively.\begin{center}\begin{tikzpicture}[node distance=3mm, >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]\node[state, initial] (Q_0) {$\mbox{}$};\node[state, initial] (Q_01) [below=1mm of Q_0] {$\mbox{}$}; \node[state, initial] (Q_02) [above=1mm of Q_0] {$\mbox{}$}; \node (R_1) [right=of Q_0] {$\ldots$};\node[state, accepting] (T_1) [right=of R_1] {$\mbox{}$};\node[state, accepting] (T_2) [above=of T_1] {$\mbox{}$};\node[state, accepting] (T_3) [below=of T_1] {$\mbox{}$};\node (A_0) [right=2.5cm of T_1] {$\mbox{}$};\node[state, initial] (A_01) [above=1mm of A_0] {$\mbox{}$};\node[state, initial] (A_02) [below=1mm of A_0] {$\mbox{}$};\node (b_1) [right=of A_0] {$\ldots$};\node[state, accepting] (c_1) [right=of b_1] {$\mbox{}$};\node[state, accepting] (c_2) [above=of c_1] {$\mbox{}$};\node[state, accepting] (c_3) [below=of c_1] {$\mbox{}$};\begin{pgfonlayer}{background} \node (1) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (Q_0) (R_1) (T_1) (T_2) (T_3)] {}; \node (2) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (A_0) (b_1) (c_1) (c_2) (c_3)] {};\node [yshift=2mm] at (1.north) {$r_1$};\node [yshift=2mm] at (2.north) {$r_2$};\end{pgfonlayer}\end{tikzpicture}\end{center}\noindent The first NFA has some accepting states and the second somestarting states. We obtain an $\epsilon$NFA for $r_1\cdot r_2$ byconnecting the accepting states of the first NFA with$\epsilon$-transitions to the starting states of the secondautomaton. By doing so we make the accepting states of the first NFAto be non-accepting like so:\begin{center}\begin{tikzpicture}[node distance=3mm, >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]\node[state, initial] (Q_0) {$\mbox{}$};\node[state, initial] (Q_01) [below=1mm of Q_0] {$\mbox{}$}; \node[state, initial] (Q_02) [above=1mm of Q_0] {$\mbox{}$}; \node (r_1) [right=of Q_0] {$\ldots$};\node[state] (t_1) [right=of r_1] {$\mbox{}$};\node[state] (t_2) [above=of t_1] {$\mbox{}$};\node[state] (t_3) [below=of t_1] {$\mbox{}$};\node (A_0) [right=2.5cm of t_1] {$\mbox{}$};\node[state] (A_01) [above=1mm of A_0] {$\mbox{}$};\node[state] (A_02) [below=1mm of A_0] {$\mbox{}$};\node (b_1) [right=of A_0] {$\ldots$};\node[state, accepting] (c_1) [right=of b_1] {$\mbox{}$};\node[state, accepting] (c_2) [above=of c_1] {$\mbox{}$};\node[state, accepting] (c_3) [below=of c_1] {$\mbox{}$};\path[->] (t_1) edge (A_01);\path[->] (t_2) edge node [above] {$\epsilon$s} (A_01);\path[->] (t_3) edge (A_01);\path[->] (t_1) edge (A_02);\path[->] (t_2) edge (A_02);\path[->] (t_3) edge node [below] {$\epsilon$s} (A_02);\begin{pgfonlayer}{background} \node (3) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (Q_0) (c_1) (c_2) (c_3)] {};\node [yshift=2mm] at (3.north) {$r_1\cdot r_2$};\end{pgfonlayer}\end{tikzpicture}\end{center}\noindent The idea behind this construction is that the start of anystring is first recognised by the first NFA, then we silently changeto the second NFA; the ending of the string is recognised by thesecond NFA...just like matching of a string by the regular expression$r_1\cdot r_2$. The Scala code for this construction is given inFigure~\ref{thompson2} in Lines 16--23. The starting states of the$\epsilon$NFA are the starting states of the first NFA (correspondingto $r_1$); the accepting function is the accepting function of thesecond NFA (corresponding to $r_2$). The new transition function isall the ``old'' transitions plus the $\epsilon$-transitions connectingthe accepting states of the first NFA to the starting states of thefirst NFA (Lines 18 and 19). The $\epsilon$NFA is then immediatelytranslated in a NFA.The case for the alternative regular expression $r_1 + r_2$ isslightly different: We are given by recursion two NFAs representing$r_1$ and $r_2$ respectively. Each NFA has some starting states andsome accepting states. We obtain a NFA for the regular expression $r_1+ r_2$ by composing the transition functions (this crucially dependson knowing that the states of each component NFA are distinct---recallwe implemented for this to hold some bespoke code for states). We alsoneed to combine the starting states and accepting functionsappropriately.\begin{center}\begin{tabular}[t]{ccc}\begin{tikzpicture}[node distance=3mm, >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20}, baseline=(current bounding box.center)]\node at (0,0) (1) {$\mbox{}$};\node (2) [above=10mm of 1] {};\node[state, initial] (4) [above=1mm of 2] {$\mbox{}$};\node[state, initial] (5) [below=1mm of 2] {$\mbox{}$};\node[state, initial] (3) [below=10mm of 1] {$\mbox{}$};\node (a) [right=of 2] {$\ldots\,$};\node (a1) [right=of a] {$$};\node[state, accepting] (a2) [above=of a1] {$\mbox{}$};\node[state, accepting] (a3) [below=of a1] {$\mbox{}$};\node (b) [right=of 3] {$\ldots$};\node[state, accepting] (b1) [right=of b] {$\mbox{}$};\node[state, accepting] (b2) [above=of b1] {$\mbox{}$};\node[state, accepting] (b3) [below=of b1] {$\mbox{}$};\begin{pgfonlayer}{background}\node (1) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (2) (a1) (a2) (a3)] {};\node (2) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (3) (b1) (b2) (b3)] {};\node [yshift=3mm] at (1.north) {$r_1$};\node [yshift=3mm] at (2.north) {$r_2$};\end{pgfonlayer}\end{tikzpicture}&\mbox{}\qquad\tikz{\draw[>=stealth,line width=2mm,->] (0,0) -- (1, 0)}\quad\mbox{}&\begin{tikzpicture}[node distance=3mm, >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20}, baseline=(current bounding box.center)]\node at (0,0) (1) {$\mbox{}$};\node (2) [above=10mm of 1] {$$};\node[state, initial] (4) [above=1mm of 2] {$\mbox{}$};\node[state, initial] (5) [below=1mm of 2] {$\mbox{}$};\node[state, initial] (3) [below=10mm of 1] {$\mbox{}$};\node (a) [right=of 2] {$\ldots\,$};\node (a1) [right=of a] {$$};\node[state, accepting] (a2) [above=of a1] {$\mbox{}$};\node[state, accepting] (a3) [below=of a1] {$\mbox{}$};\node (b) [right=of 3] {$\ldots$};\node[state, accepting] (b1) [right=of b] {$\mbox{}$};\node[state, accepting] (b2) [above=of b1] {$\mbox{}$};\node[state, accepting] (b3) [below=of b1] {$\mbox{}$};%\path[->] (1) edge node [above] {$\epsilon$} (2);%\path[->] (1) edge node [below] {$\epsilon$} (3);\begin{pgfonlayer}{background}\node (3) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (1) (a2) (a3) (b2) (b3)] {};\node [yshift=3mm] at (3.north) {$r_1+ r_2$};\end{pgfonlayer}\end{tikzpicture}\end{tabular}\end{center}\noindent The code for this construction is in Figure~\ref{thompson2}in Lines 25--33.Finally for the $*$-case we have a NFA for $r$ and connect itsaccepting states to a new starting state via$\epsilon$-transitions. This new starting state is also an acceptingstate, because $r^*$ can recognise the empty string.\begin{center}\begin{tabular}[b]{@{}ccc@{}} \begin{tikzpicture}[node distance=3mm, >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20}, baseline=(current bounding box.north)]\node (2) {$\mbox{}$};\node[state, initial] (4) [above=1mm of 2] {$\mbox{}$};\node[state, initial] (5) [below=1mm of 2] {$\mbox{}$}; \node (a) [right=of 2] {$\ldots$};\node[state, accepting] (a1) [right=of a] {$\mbox{}$};\node[state, accepting] (a2) [above=of a1] {$\mbox{}$};\node[state, accepting] (a3) [below=of a1] {$\mbox{}$};\begin{pgfonlayer}{background}\node (1) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (2) (a1) (a2) (a3)] {};\node [yshift=3mm] at (1.north) {$r$};\end{pgfonlayer}\end{tikzpicture}&\raisebox{-16mm}{\;\tikz{\draw[>=stealth,line width=2mm,->] (0,0) -- (1, 0)}}&\begin{tikzpicture}[node distance=3mm, >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20}, baseline=(current bounding box.north)]\node at (0,0) [state, initial,accepting] (1) {$\mbox{}$};\node (2) [right=16mm of 1] {$\mbox{}$};\node[state] (4) [above=1mm of 2] {$\mbox{}$};\node[state] (5) [below=1mm of 2] {$\mbox{}$}; \node (a) [right=of 2] {$\ldots$};\node[state] (a1) [right=of a] {$\mbox{}$};\node[state] (a2) [above=of a1] {$\mbox{}$};\node[state] (a3) [below=of a1] {$\mbox{}$};\path[->] (1) edge node [below] {$\epsilon$} (4);\path[->] (1) edge node [below] {$\epsilon$} (5);\path[->] (a1) edge [bend left=45] node [below] {$\epsilon$} (1);\path[->] (a2) edge [bend right] node [below] {$\epsilon$} (1);\path[->] (a3) edge [bend left=45] node [below] {$\epsilon$} (1);\begin{pgfonlayer}{background}\node (2) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (1) (a2) (a3)] {};\node [yshift=3mm] at (2.north) {$r^*$};\end{pgfonlayer}\end{tikzpicture} \end{tabular}\end{center}\noindent The corresponding code is in Figure~\ref{thompson2} in Lines 35--43)To sum up, you can see in the sequence and star cases the need ofhaving silent $\epsilon$-transitions. Similarly the alternative caseshows the need of the NFA-nondeterminism. It seems awkward to form the`alternative' composition of two DFAs, because DFA do not allowseveral starting and successor states. All these constructions can nowbe put together in the following recursive function:{\small\begin{lstlisting}[language=Scala]def thompson(r: Rexp) : NFAt = r match { case ZERO => NFA_ZERO() case ONE => NFA_ONE() case CHAR(c) => NFA_CHAR(c) case ALT(r1, r2) => NFA_ALT(thompson(r1), thompson(r2)) case SEQ(r1, r2) => NFA_SEQ(thompson(r1), thompson(r2)) case STAR(r1) => NFA_STAR(thompson(r1))}\end{lstlisting}}\noindentIt calculates a NFA from a regular expressions. At last we can runNFAs for the our evil regular expression examples. The graph on theleft shows that when translating a regular expression such as$a^{\{n\}}$ into a NFA, the size can blow up and then even therelative fast (on small examples) breadth-first search can beslow. The graph on the right shows that with `evil' regularexpressions the depth-first search can be abysmally slow. Even if thegraphs not completely overlap with the curves of Python, Ruby andJava, they are similar enough. OK\ldots now you know why regularexpression matchers in those languages are so slow. \begin{center}\begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1mm}}c@{}} \begin{tikzpicture}\begin{axis}[ title={Graph: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$}, title style={yshift=-2ex}, xlabel={$n$}, x label style={at={(1.05,0.0)}}, ylabel={time in secs}, enlargelimits=false, xtick={0,5,...,30}, xmax=33, ymax=35, ytick={0,5,...,30}, scaled ticks=false, axis lines=left, width=5.5cm, height=4cm, legend entries={Python,Ruby, breadth-first NFA}, legend style={at={(0.5,-0.25)},anchor=north,font=\small}, legend cell align=left] \addplot[blue,mark=*, mark options={fill=white}] table {re-python.data}; \addplot[brown,mark=triangle*, mark options={fill=white}] table {re-ruby.data}; % breath-first search in NFAs \addplot[red,mark=*, mark options={fill=white}] table { 1 0.00586 2 0.01209 3 0.03076 4 0.08269 5 0.12881 6 0.25146 7 0.51377 8 0.89079 9 1.62802 10 3.05326 11 5.92437 12 11.67863 13 24.00568 };\end{axis}\end{tikzpicture}&\begin{tikzpicture}\begin{axis}[ title={Graph: $(a^*)^* \cdot b$ and strings $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$}, title style={yshift=-2ex}, xlabel={$n$}, x label style={at={(1.05,0.0)}}, ylabel={time in secs}, enlargelimits=false, xtick={0,5,...,30}, xmax=33, ymax=35, ytick={0,5,...,30}, scaled ticks=false, axis lines=left, width=5.5cm, height=4cm, legend entries={Python, Java, depth-first NFA}, legend style={at={(0.5,-0.25)},anchor=north,font=\small}, legend cell align=left] \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data}; \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data}; % depth-first search in NFAs \addplot[red,mark=*, mark options={fill=white}] table { 1 0.00605 2 0.03086 3 0.11994 4 0.45389 5 2.06192 6 8.04894 7 32.63549 };\end{axis}\end{tikzpicture}\end{tabular}\end{center}\subsection*{Subset Construction}Of course, some developers of regular expression matchers are aware ofthese problems with sluggish NFAs and try to address them. One commontechnique for alleviating the problem I like to show you in thissection. This will also explain why we insisted on polymorphic types inour DFA code (remember I used \texttt{A} and \texttt{C} for the typesof states and the input, see Figure~\ref{dfa} onPage~\pageref{dfa}).\bigskip\noindentTo start remember that we did not bother with defining andimplementing $\epsilon$NFAs: we immediately translated them intoequivalent NFAs. Equivalent in the sense of accepting the samelanguage (though we only claimed this and did not prove itrigorously). Remember also that NFAs have non-deterministictransitions defined as a relation or implemented as function returningsets of states. This non-determinism is crucial for the ThompsonConstruction to work (recall the cases for $\cdot$, $+$ and${}^*$). But this non-determinism makes it harder with NFAs to decidewhen a string is accepted or not; whereas such a decision is ratherstraightforward with DFAs: recall their transition function is a\emph{function} that returns a single state. So with DFAs we do nothave to search at all. What is perhaps interesting is the fact thatfor every NFA we can find a DFA that also recognises the samelanguage. This might sound a bit paradoxical: NFA $\rightarrow$decision of acceptance hard; DFA $\rightarrow$ decision easy. But this\emph{is} true\ldots but of course there is always a caveat---nothingever is for free in life.There are actually a number of methods for transforming a NFA intoan equivalent DFA, but the most famous one is the \emph{subset construction}. Consider the following NFA where the states arelabelled with $0$, $1$ and $2$.\begin{center}\begin{tabular}{c@{\hspace{10mm}}c}\begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=0pt, draw=blue!50,very thick,fill=blue!20}, baseline=(current bounding box.center)]\node[state,initial] (Q_0) {$0$};\node[state] (Q_1) [below=of Q_0] {$1$};\node[state, accepting] (Q_2) [below=of Q_1] {$2$};\path[->] (Q_0) edge node [right] {$b$} (Q_1);\path[->] (Q_1) edge node [right] {$a,b$} (Q_2);\path[->] (Q_0) edge [loop above] node {$a, b$} ();\end{tikzpicture}&\begin{tabular}{r|ll}states & $a$ & $b$\\\hline$\{\}\phantom{\star}$ & $\{\}$ & $\{\}$\\start: $\{0\}\phantom{\star}$ & $\{0\}$ & $\{0,1\}$\\$\{1\}\phantom{\star}$ & $\{2\}$ & $\{2\}$\\$\{2\}\star$ & $\{\}$ & $\{\}$\\$\{0,1\}\phantom{\star}$ & $\{0,2\}$ & $\{0,1,2\}$\\$\{0,2\}\star$ & $\{0\}$ & $\{0,1\}$\\$\{1,2\}\star$ & $\{2\}$ & $\{2\}$\\$\{0,1,2\}\star$ & $\{0,2\}$ & $\{0,1,2\}$\\\end{tabular}\end{tabular}\end{center}\noindent The states of the corresponding DFA are given by generatingall subsets of the set $\{0,1,2\}$ (seen in the states columnin the table on the right). The other columns define the transitionfunction for the DFA for inputs $a$ and $b$. The first row states that$\{\}$ is the sink state which has transitions for $a$ and $b$ toitself. The next three lines are calculated as follows:\begin{itemize}\item Suppose you calculate the entry for the $a$-transition for state $\{0\}$. Look for all states in the NFA that can be reached by such a transition from this state; this is only state $0$; therefore from state $\{0\}$ we can go to state $\{0\}$ via an $a$-transition.\item Do the same for the $b$-transition; you can reach states $0$ and $1$ in the NFA; therefore in the DFA we can go from state $\{0\}$ to state $\{0,1\}$ via an $b$-transition.\item Continue with the states $\{1\}$ and $\{2\}$.\end{itemize}\noindentOnce you filled in the transitions for `simple' states $\{0\}$.. $\{2\}$, you only have to build the union for the compound states$\{0,1\}$, $\{0,2\}$ and so on. For example for $\{0,1\}$ you take theunion of Line $\{0\}$ and Line $\{1\}$, which gives $\{0,2\}$ for $a$,and $\{0,1,2\}$ for $b$. And so on.The starting state of the DFA can be calculated from the startingstates of the NFA, that is in this case $\{0\}$. But in general therecan of course be many starting states in the NFA and you would takethe corresponding subset as \emph{the} starting state of the DFA.The accepting states in the DFA are given by all sets that contain a$2$, which is the only accpting state in this NFA. But again ingeneral if the subset contains any accepting state from the NFA, thenthe corresponding state in the DFA is accepting as well. Thiscompletes the subset construction. The corresponding DFA for the NFAshown above is:\begin{equation}\begin{tikzpicture}[scale=0.8,>=stealth',very thick, every state/.style={minimum size=0pt, draw=blue!50,very thick,fill=blue!20}, baseline=(current bounding box.center)]\node[state,initial] (q0) {$0$};\node[state] (q01) [right=of q0] {$0,1$};\node[state,accepting] (q02) [below=of q01] {$0,2$};\node[state,accepting] (q012) [right=of q02] {$0,1,2$};\node[state] (q1) [below=0.5cm of q0] {$1$};\node[state,accepting] (q2) [below=1cm of q1] {$2$};\node[state] (qn) [below left=1cm of q2] {$\{\}$};\node[state,accepting] (q12) [below right=1cm of q2] {$1,2$};\path[->] (q0) edge node [above] {$b$} (q01);\path[->] (q01) edge node [above] {$b$} (q012);\path[->] (q0) edge [loop above] node {$a$} ();\path[->] (q012) edge [loop right] node {$b$} ();\path[->] (q012) edge node [below] {$a$} (q02);\path[->] (q02) edge node [below] {$a$} (q0);\path[->] (q01) edge [bend left] node [left] {$a$} (q02);\path[->] (q02) edge [bend left] node [right] {$b$} (q01);\path[->] (q1) edge node [left] {$a,b$} (q2);\path[->] (q12) edge node [right] {$a, b$} (q2);\path[->] (q2) edge node [right] {$a, b$} (qn);\path[->] (qn) edge [loop left] node {$a,b$} ();\end{tikzpicture}\label{subsetdfa}\end{equation}\noindentPlease check that this is indeed a DFA. The big question is whetherthis DFA can recognise the same language as the NFA we started with?I let you ponder about this question.There are also two points to note: One is that very often in thesubset construction the resulting DFA contains a number of ``dead''states that are never reachable from the starting state. This isobvious in the example, where state $\{1\}$, $\{2\}$, $\{1,2\}$ and$\{\}$ can never be reached from the starting state. But this mightnot always be as obvious as that. In effect the DFA in this example isnot a \emph{minimal} DFA (more about this in a minute). Such deadstates can be safely removed without changing the language that isrecognised by the DFA. Another point is that in some cases, however,the subset construction produces a DFA that does \emph{not} containany dead states\ldots{}this means it calculates a minimal DFA. Whichin turn means that in some cases the number of states can by goingfrom NFAs to DFAs exponentially increase, namely by $2^n$ (which isthe number of subsets you can form for sets of $n$ states). This blowup in the number of states in the DFA is again bad news for howquickly you can decide whether a string is accepted by a DFA ornot. So the caveat with DFAs is that they might make the task offinding the next state trival, but might require $2^n$ times as manystates then a NFA.\bigskip\noindentTo conclude this section, how conveniently we canimplement the subset construction with our versions of NFAs andDFAs? Very conveninetly. The code is just:{\small\begin{lstlisting}[language=Scala]def subset[A, C](nfa: NFA[A, C]) : DFA[Set[A], C] = { DFA(nfa.starts, { case (qs, c) => nfa.nexts(qs, c) }, _.exists(nfa.fins))}\end{lstlisting}} \noindentThe interesting point in this code is that the state type of thecalculated DFA is \texttt{Set[A]}. Think carefully that this works outcorrectly.The DFA is then given by three components: the starting states, thetransition function and the accepting-states function. The startingstates are a set in the given NFA, but a single state in the DFA. Thetransition function, given the state \texttt{qs} and input \texttt{c},needs to produce the next state: this is the set of all NFA statesthat are reachable from each state in \texttt{qs}. The function\texttt{nexts} from the NFA class already calculates this for us. Theaccepting-states function for the DFA is true henevner at least onestate in the subset is accepting (that is true) in the NFA.\medskip\noindentYou might be able to spend some quality time tinkering with this codeand time to ponder about it. Then you will probably notice that it isactually a bit silly. The whole point of translating the NFA into aDFA via the subset construction is to make the decision of whether astring is accepted or not faster. Given the code above, the generatedDFA will be exactly as fast, or as slow, as the NFA we started with(actually it will even be a tiny bit slower). The reason is that wejust re-use the \texttt{nexts} function from the NFA. This functionimplements the non-deterministic breadth-first search. You might bethinking: This is cheating! \ldots{} Well, not quite as you will seelater, but in terms of speed we still need to work a bit in order toget sometimes(!) a faster DFA. Let's do this next.\subsection*{DFA Minimisation}As seen in \eqref{subsetdfa}, the subset construction from NFA to aDFA can result in a rather ``inefficient'' DFA. Meaning there arestates that are not needed. There are two kinds of such unneededstates: \emph{unreachable} states and \emph{non-distinguishable}states. The first kind of states can just be removed without affectingthe language that can be recognised (after all they areunreachable). The second kind can also be recognised and thus a DFAcan be \emph{minimised} by the following algorithm:\begin{enumerate}\item Take all pairs $(q, p)$ with $q \not= p$\item Mark all pairs that accepting and non-accepting states\item For all unmarked pairs $(q, p)$ and all characters $c$ test whether \begin{center} $(\delta(q, c), \delta(p,c))$ \end{center} are marked. If there is one, then also mark $(q, p)$.\item Repeat last step until no change.\item All unmarked pairs can be merged.\end{enumerate}\noindent Unfortunately, once we throw away all unreachable states in\eqref{subsetdfa}, all remaining states are needed. In order toillustrate the minimisation algorithm, consider the following DFA.\begin{center}\begin{tikzpicture}[>=stealth',very thick,auto, every state/.style={minimum size=0pt, inner sep=2pt,draw=blue!50,very thick, fill=blue!20}]\node[state,initial] (Q_0) {$Q_0$};\node[state] (Q_1) [right=of Q_0] {$Q_1$};\node[state] (Q_2) [below right=of Q_0] {$Q_2$};\node[state] (Q_3) [right=of Q_2] {$Q_3$};\node[state, accepting] (Q_4) [right=of Q_1] {$Q_4$};\path[->] (Q_0) edge node [above] {$a$} (Q_1);\path[->] (Q_1) edge node [above] {$a$} (Q_4);\path[->] (Q_4) edge [loop right] node {$a, b$} ();\path[->] (Q_3) edge node [right] {$a$} (Q_4);\path[->] (Q_2) edge node [above] {$a$} (Q_3);\path[->] (Q_1) edge node [right] {$b$} (Q_2);\path[->] (Q_0) edge node [above] {$b$} (Q_2);\path[->] (Q_2) edge [loop left] node {$b$} ();\path[->] (Q_3) edge [bend left=95, looseness=1.3] node [below] {$b$} (Q_0);\end{tikzpicture}\end{center}\noindent In Step 1 and 2 we consider essentially a triangleof the form\begin{center}\begin{tikzpicture}[scale=0.6,line width=0.8mm]\draw (0,0) -- (4,0);\draw (0,1) -- (4,1);\draw (0,2) -- (3,2);\draw (0,3) -- (2,3);\draw (0,4) -- (1,4);\draw (0,0) -- (0, 4);\draw (1,0) -- (1, 4);\draw (2,0) -- (2, 3);\draw (3,0) -- (3, 2);\draw (4,0) -- (4, 1);\draw (0.5,-0.5) node {$Q_0$}; \draw (1.5,-0.5) node {$Q_1$}; \draw (2.5,-0.5) node {$Q_2$}; \draw (3.5,-0.5) node {$Q_3$};\draw (-0.5, 3.5) node {$Q_1$}; \draw (-0.5, 2.5) node {$Q_2$}; \draw (-0.5, 1.5) node {$Q_3$}; \draw (-0.5, 0.5) node {$Q_4$}; \draw (0.5,0.5) node {\large$\star$}; \draw (1.5,0.5) node {\large$\star$}; \draw (2.5,0.5) node {\large$\star$}; \draw (3.5,0.5) node {\large$\star$};\end{tikzpicture}\end{center}\noindent where the lower row is filled with stars, because inthe corresponding pairs there is always one state that isaccepting ($Q_4$) and a state that is non-accepting (the otherstates).In Step 3 we need to fill in more stars according whether one of the next-state pairs are marked. We have to do this for every unmarked field until there is no change anymore.This gives the triangle\begin{center}\begin{tikzpicture}[scale=0.6,line width=0.8mm]\draw (0,0) -- (4,0);\draw (0,1) -- (4,1);\draw (0,2) -- (3,2);\draw (0,3) -- (2,3);\draw (0,4) -- (1,4);\draw (0,0) -- (0, 4);\draw (1,0) -- (1, 4);\draw (2,0) -- (2, 3);\draw (3,0) -- (3, 2);\draw (4,0) -- (4, 1);\draw (0.5,-0.5) node {$Q_0$}; \draw (1.5,-0.5) node {$Q_1$}; \draw (2.5,-0.5) node {$Q_2$}; \draw (3.5,-0.5) node {$Q_3$};\draw (-0.5, 3.5) node {$Q_1$}; \draw (-0.5, 2.5) node {$Q_2$}; \draw (-0.5, 1.5) node {$Q_3$}; \draw (-0.5, 0.5) node {$Q_4$}; \draw (0.5,0.5) node {\large$\star$}; \draw (1.5,0.5) node {\large$\star$}; \draw (2.5,0.5) node {\large$\star$}; \draw (3.5,0.5) node {\large$\star$};\draw (0.5,1.5) node {\large$\star$}; \draw (2.5,1.5) node {\large$\star$}; \draw (0.5,3.5) node {\large$\star$}; \draw (1.5,2.5) node {\large$\star$}; \end{tikzpicture}\end{center}\noindent which means states $Q_0$ and $Q_2$, as well as $Q_1$and $Q_3$ can be merged. This gives the following minimal DFA\begin{center}\begin{tikzpicture}[>=stealth',very thick,auto, every state/.style={minimum size=0pt, inner sep=2pt,draw=blue!50,very thick, fill=blue!20}]\node[state,initial] (Q_02) {$Q_{0, 2}$};\node[state] (Q_13) [right=of Q_02] {$Q_{1, 3}$};\node[state, accepting] (Q_4) [right=of Q_13] {$Q_{4\phantom{,0}}$};\path[->] (Q_02) edge [bend left] node [above] {$a$} (Q_13);\path[->] (Q_13) edge [bend left] node [below] {$b$} (Q_02);\path[->] (Q_02) edge [loop below] node {$b$} ();\path[->] (Q_13) edge node [above] {$a$} (Q_4);\path[->] (Q_4) edge [loop above] node {$a, b$} ();\end{tikzpicture}\end{center}By the way, we are not bothering with implementing the aboveminimisation algorith: while up to now all the transformations usedsome clever composition of functions, the minimisation algorithmcannot be implemented by just composing some functions. For this wewould require a more concrete representation of the transitionfunction (like maps). If we did this, however, then many advantages ofthe functions would be thrown away. So the compromise is to not beingable to minimise (easily) our DFAs.\subsection*{Brzozowski's Method}I know this handout is already a long, long rant: but after all it isa topic that has been researched for more than 60 years. If youreflect on what you have read so far, the story is that you can take aregular expression, translate it via the Thompson Construction into an$\epsilon$NFA, then translate it into a NFA by removing all$\epsilon$-transitions, and then via the subset construction obtain aDFA. In all steps we made sure the language, or which strings can berecognised, stays the same. Of couse we should have proved this ineach step, but let us cut corners here. After the last section, wecan even minimise the DFA (maybe not in code). But again we made surethe same language is recognised. You might be wondering: Can we gointo the other direction? Can we go from a DFA and obtain a regularexpression that can recognise the same language as the DFA?\medskip\noindentThe answer is yes. Again there are several methods for calculating aregular expression for a DFA. I will show you Brzozowski's methodbecause it calculates a regular expression using quite familiartransformations for solving equational systems. Consider the DFA:\begin{center}\begin{tikzpicture}[scale=1.5,>=stealth',very thick,auto, every state/.style={minimum size=0pt, inner sep=2pt,draw=blue!50,very thick, fill=blue!20}] \node[state, initial] (q0) at ( 0,1) {$Q_0$}; \node[state] (q1) at ( 1,1) {$Q_1$}; \node[state, accepting] (q2) at ( 2,1) {$Q_2$}; \path[->] (q0) edge[bend left] node[above] {$a$} (q1) (q1) edge[bend left] node[above] {$b$} (q0) (q2) edge[bend left=50] node[below] {$b$} (q0) (q1) edge node[above] {$a$} (q2) (q2) edge [loop right] node {$a$} () (q0) edge [loop below] node {$b$} ();\end{tikzpicture}\end{center}\noindent for which we can set up the following equationalsystem\begin{eqnarray}Q_0 & = & \ONE + Q_0\,b + Q_1\,b + Q_2\,b\\Q_1 & = & Q_0\,a\\Q_2 & = & Q_1\,a + Q_2\,a\end{eqnarray}\noindent There is an equation for each node in the DFA. Letus have a look how the right-hand sides of the equations areconstructed. First have a look at the second equation: theleft-hand side is $Q_1$ and the right-hand side $Q_0\,a$. Theright-hand side is essentially all possible ways how to end upin node $Q_1$. There is only one incoming edge from $Q_0$ consumingan $a$. Therefore the right hand side is thisstate followed by character---in this case $Q_0\,a$. Now letshave a look at the third equation: there are two incomingedges for $Q_2$. Therefore we have two terms, namely $Q_1\,a$ and$Q_2\,a$. These terms are separated by $+$. The first statesthat if in state $Q_1$ consuming an $a$ will bring you to$Q_2$, and the second that being in $Q_2$ and consuming an $a$will make you stay in $Q_2$. The right-hand side of thefirst equation is constructed similarly: there are three incoming edges, therefore there are three terms. There isone exception in that we also ``add'' $\ONE$ to thefirst equation, because it corresponds to the starting statein the DFA.Having constructed the equational system, the question ishow to solve it? Remarkably the rules are very similar tosolving usual linear equational systems. For example thesecond equation does not contain the variable $Q_1$ on theright-hand side of the equation. We can therefore eliminate $Q_1$ from the system by just substituting this equationinto the other two. This gives\begin{eqnarray}Q_0 & = & \ONE + Q_0\,b + Q_0\,a\,b + Q_2\,b\\Q_2 & = & Q_0\,a\,a + Q_2\,a\end{eqnarray}\noindent where in Equation (4) we have two occurrencesof $Q_0$. Like the laws about $+$ and $\cdot$, we can simplify Equation (4) to obtain the following two equations:\begin{eqnarray}Q_0 & = & \ONE + Q_0\,(b + a\,b) + Q_2\,b\\Q_2 & = & Q_0\,a\,a + Q_2\,a\end{eqnarray}\noindent Unfortunately we cannot make any more progress withsubstituting equations, because both (6) and (7) contain thevariable on the left-hand side also on the right-hand side.Here we need to now use a law that is different from the usuallaws about linear equations. It is called \emph{Arden's rule}.It states that if an equation is of the form $q = q\,r + s$then it can be transformed to $q = s\, r^*$. Since we canassume $+$ is symmetric, Equation (7) is of that form: $s$ is$Q_0\,a\,a$ and $r$ is $a$. That means we can transform(7) to obtain the two new equations\begin{eqnarray}Q_0 & = & \ONE + Q_0\,(b + a\,b) + Q_2\,b\\Q_2 & = & Q_0\,a\,a\,(a^*)\end{eqnarray}\noindent Now again we can substitute the second equation intothe first in order to eliminate the variable $Q_2$.\begin{eqnarray}Q_0 & = & \ONE + Q_0\,(b + a\,b) + Q_0\,a\,a\,(a^*)\,b\end{eqnarray}\noindent Pulling $Q_0$ out as a single factor gives:\begin{eqnarray}Q_0 & = & \ONE + Q_0\,(b + a\,b + a\,a\,(a^*)\,b)\end{eqnarray}\noindent This equation is again of the form so that we canapply Arden's rule ($r$ is $b + a\,b + a\,a\,(a^*)\,b$ and $s$is $\ONE$). This gives as solution for $Q_0$ the followingregular expression:\begin{eqnarray}Q_0 & = & \ONE\,(b + a\,b + a\,a\,(a^*)\,b)^*\end{eqnarray}\noindent Since this is a regular expression, we can simplifyaway the $\ONE$ to obtain the slightly simpler regularexpression\begin{eqnarray}Q_0 & = & (b + a\,b + a\,a\,(a^*)\,b)^*\end{eqnarray}\noindent Now we can unwind this process and obtain the solutionsfor the other equations. This gives:\begin{eqnarray}Q_0 & = & (b + a\,b + a\,a\,(a^*)\,b)^*\\Q_1 & = & (b + a\,b + a\,a\,(a^*)\,b)^*\,a\\Q_2 & = & (b + a\,b + a\,a\,(a^*)\,b)^*\,a\,a\,(a)^*\end{eqnarray}\noindent Finally, we only need to ``add'' up the equationswhich correspond to a terminal state. In our running example,this is just $Q_2$. Consequently, a regular expressionthat recognises the same language as the DFA is\[(b + a\,b + a\,a\,(a^*)\,b)^*\,a\,a\,(a)^*\]\noindent You can somewhat crosscheck your solution by taking a stringthe regular expression can match and and see whether it can be matchedby the DFA. One string for example is $aaa$ and \emph{voila} thisstring is also matched by the automaton.We should prove that Brzozowski's method really produces an equivalentregular expression. But for the purposes of this module, we omitthis. I guess you are relieved.\subsection*{Regular Languages}Given the constructions in the previous sections we obtain the following overall picture:\begin{center}\begin{tikzpicture}\node (rexp) {\bf Regexps};\node (nfa) [right=of rexp] {\bf NFAs};\node (dfa) [right=of nfa] {\bf DFAs};\node (mdfa) [right=of dfa] {\bf\begin{tabular}{c}minimal\\ DFAs\end{tabular}};\path[->,line width=1mm] (rexp) edge node [above=4mm, black] {\begin{tabular}{c@{\hspace{9mm}}}Thompson's\\[-1mm] construction\end{tabular}} (nfa);\path[->,line width=1mm] (nfa) edge node [above=4mm, black] {\begin{tabular}{c}subset\\[-1mm] construction\end{tabular}}(dfa);\path[->,line width=1mm] (dfa) edge node [below=5mm, black] {minimisation} (mdfa);\path[->,line width=1mm] (dfa) edge [bend left=45] node [below] {\begin{tabular}{l}Brzozowski's\\ method\end{tabular}} (rexp);\end{tikzpicture}\end{center}\noindent By going from regular expressions over NFAs to DFAs,we can always ensure that for every regular expression thereexists a NFA and a DFA that can recognise the same language.Although we did not prove this fact. Similarly by going fromDFAs to regular expressions, we can make sure for every DFA there exists a regular expression that can recognise the samelanguage. Again we did not prove this fact. The fundamental conclusion we can draw is that automata and regularexpressions can recognise the same set of languages:\begin{quote} A language is \emph{regular} iff there exists aregular expression that recognises all its strings.\end{quote}\noindent or equivalently \begin{quote} A language is \emph{regular} iff there exists anautomaton that recognises all its strings.\end{quote}\noindent Note that this is not a stement for a particular language(that is a particular set of strings), but about a large class oflanguages, namely the regular ones.As a consequence for deciding whether a string is recognised by aregular expression, we could use our algorithm based on derivatives orNFAs or DFAs. But let us quickly look at what the differences mean incomputational terms. Translating a regular expression into a NFA givesus an automaton that has $O(n)$ states---that means the size of theNFA grows linearly with the size of the regular expression. Theproblem with NFAs is that the problem of deciding whether a string isaccepted or not is computationally not cheap. Remember with NFAs wehave potentially many next states even for the same input and alsohave the silent $\epsilon$-transitions. If we want to find a path fromthe starting state of a NFA to an accepting state, we need to considerall possibilities. In Ruby, Python and Java this is done by adepth-first search, which in turn means that if a ``wrong'' choice ismade, the algorithm has to backtrack and thus explore all potentialcandidates. This is exactly the reason why Ruby, Python and Java areso slow for evil regular expressions. An alternative to thepotentially slow depth-first search is to explore the search space ina breadth-first fashion, but this might incur a big memory penalty.To avoid the problems with NFAs, we can translate them into DFAs. WithDFAs the problem of deciding whether a string is recognised or not ismuch simpler, because in each state it is completely determined whatthe next state will be for a given input. So no search is needed. Theproblem with this is that the translation to DFAs can explodeexponentially the number of states. Therefore when this route istaken, we definitely need to minimise the resulting DFAs in order tohave an acceptable memory and runtime behaviour. But remember thesubset construction in the worst case explodes the number of states by$2^n$. Effectively also the translation to DFAs can incur a bigruntime penalty.But this does not mean that everything is bad with automata. Recallthe problem of finding a regular expressions for the language that is\emph{not} recognised by a regular expression. In our implementationwe added explicitly such a regular expressions because they are usefulfor recognising comments. But in principle we did not need to. Theargument for this is as follows: take a regular expression, translateit into a NFA and then a DFA that both recognise the samelanguage. Once you have the DFA it is very easy to construct theautomaton for the language not recognised by a DFA. If the DFA iscompleted (this is important!), then you just need to exchange theaccepting and non-accepting states. You can then translate this DFAback into a regular expression and that will be the regular expressionthat can match all strings the original regular expression could\emph{not} match.It is also interesting that not all languages are regular. The mostwell-known example of a language that is not regular consists of allthe strings of the form\[a^n\,b^n\]\noindent meaning strings that have the same number of $a$s and$b$s. You can try, but you cannot find a regular expression for thislanguage and also not an automaton. One can actually prove that thereis no regular expression nor automaton for this language, but againthat would lead us too far afield for what we want to do in thismodule.\subsection*{Where Have Derivatives Gone?}Still to be done\bigskip%\noindent%By now you are probably fed up with this text. It is now way too long%for one lecture, but there is still one aspect of the%automata-regular-expression-connection I like to describe. Perhaps by%now you are asking yourself: Where have the derivatives gone? Did we%just forget them? Well, they have a place in the picture of%calculating a DFA from the regular expression.%To be done%%\begin{center}%\begin{tikzpicture}% [level distance=25mm,very thick,auto,% level 1/.style={sibling distance=30mm},% level 2/.style={sibling distance=15mm},% every node/.style={minimum size=30pt,% inner sep=0pt,circle,draw=blue!50,very thick,% fill=blue!20}]% \node {$r$} [grow=right]% child[->] {node (cn) {$d_{c_n}$}% child { node {$dd_{c_nc_n}$}}% child { node {$dd_{c_nc_1}$}}% %edge from parent node[left] {$c_n$}% }% child[->] {node (c1) {$d_{c_1}$}% child { node {$dd_{c_1c_n}$}}% child { node {$dd_{c_1c_1}$}}% %edge from parent node[left] {$c_1$}% };% %%\draw (cn) -- (c1) node {\vdots}; %\end{tikzpicture} %\end{center}%\section*{Further Reading}%Compare what a ``human expert'' would create as an automaton for the%regular expression $a\cdot (b + c)^*$ and what the Thomson%algorithm generates.%http://www.inf.ed.ac.uk/teaching/courses/ct/\end{document}%%% Local Variables: %%% mode: latex%%% TeX-master: t%%% End: