handouts/ho03.tex
changeset 491 d5776c6018f0
parent 490 4fee50f38305
child 492 39b7ff2cf1bc
equal deleted inserted replaced
490:4fee50f38305 491:d5776c6018f0
   463   transition function of $\epsilon$NFA takes as input an \texttt{Option[C]}.
   463   transition function of $\epsilon$NFA takes as input an \texttt{Option[C]}.
   464   \texttt{None} stands for an $\epsilon$-transition; \texttt{Some(c)}
   464   \texttt{None} stands for an $\epsilon$-transition; \texttt{Some(c)}
   465   for a ``proper'' transition consuming a character. The functions in
   465   for a ``proper'' transition consuming a character. The functions in
   466   Lines 18--26 calculate
   466   Lines 18--26 calculate
   467   all states reachable by one or more $\epsilon$-transition for a given
   467   all states reachable by one or more $\epsilon$-transition for a given
   468   set of states. The NFA is constructed in Lines 36--38.\label{enfa}}
   468   set of states. The NFA is constructed in Lines 36--38.
       
   469   Note the interesting commands in Lines 5 and 6: their purpose is
       
   470   to ensure that \texttt{fixpT} is the tail-recursive version of
       
   471   the fixpoint construction; otherwise we would quickly get a
       
   472   stack-overflow exception, even on small examples, due to limitations
       
   473   of the JVM.
       
   474   \label{enfa}}
   469 \end{figure}
   475 \end{figure}
   470 
   476 
   471 Also look carefully how the transitions of $\epsilon$NFAs are
   477 Also look carefully how the transitions of $\epsilon$NFAs are
   472 implemented. The additional possibility of performing silent
   478 implemented. The additional possibility of performing silent
   473 transitions is encoded by using \texttt{Option[C]} as the type for the
   479 transitions is encoded by using \texttt{Option[C]} as the type for the
   884 
   890 
   885 
   891 
   886 
   892 
   887 \subsection*{Subset Construction}
   893 \subsection*{Subset Construction}
   888 
   894 
   889 Of course, some developers of regular expression matchers are aware
   895 Of course, some developers of regular expression matchers are aware of
   890 of these problems with sluggish NFAs and try to address them. One
   896 these problems with sluggish NFAs and try to address them. One common
   891 common technique for this I like to show you in this section. It will
   897 technique for alleviating the problem I like to show you in this
   892 also explain why we insisted on polymorphic types in our DFA code
   898 section. This will also explain why we insisted on polymorphic types in
   893 (remember I used \texttt{A} and \texttt{C} for the types of states and
   899 our DFA code (remember I used \texttt{A} and \texttt{C} for the types
   894 the input, see Figure~\ref{dfa} on Page~\pageref{dfa}).\bigskip
   900 of states and the input, see Figure~\ref{dfa} on
       
   901 Page~\pageref{dfa}).\bigskip
   895 
   902 
   896 \noindent
   903 \noindent
   897 To start, remember that we did not bother with defining and
   904 To start remember that we did not bother with defining and
   898 implementing $\epsilon$NFA; we immediately translated them into
   905 implementing $\epsilon$NFAs: we immediately translated them into
   899 equivalent NFAs. Equivalent in the sense of accepting the same
   906 equivalent NFAs. Equivalent in the sense of accepting the same
   900 language (though we only claimed this and did not prove it
   907 language (though we only claimed this and did not prove it
   901 rigorously). Remember also that NFAs have non-deterministic
   908 rigorously). Remember also that NFAs have non-deterministic
   902 transitions defined as a relation or implemented as function returning
   909 transitions defined as a relation or implemented as function returning
   903 sets of states.  This non-determinism is crucial for the Thompson
   910 sets of states.  This non-determinism is crucial for the Thompson
   904 Construction to work (recall the cases for $\cdot$, $+$ and
   911 Construction to work (recall the cases for $\cdot$, $+$ and
   905 ${}^*$). But this non-determinism makes it harder with NFAs to decide
   912 ${}^*$). But this non-determinism makes it harder with NFAs to decide
   906 when a string is accepted or not; such a decision is rather
   913 when a string is accepted or not; whereas such a decision is rather
   907 straightforward with DFAs: recall their transition function is a
   914 straightforward with DFAs: recall their transition function is a
   908 \emph{function} that returns a single state. So we do not have to
   915 \emph{function} that returns a single state. So with DFAs we do not
   909 search at all.  What is perhaps interesting is the fact that for every
   916 have to search at all.  What is perhaps interesting is the fact that
   910 NFA we can find a DFA that also recognises the same language. This
   917 for every NFA we can find a DFA that also recognises the same
   911 might sound a bit paradoxical: NFA $\rightarrow$ decision of
   918 language. This might sound a bit paradoxical: NFA $\rightarrow$
   912 acceptance hard; DFA $\rightarrow$ decision easy. But this \emph{is}
   919 decision of acceptance hard; DFA $\rightarrow$ decision easy. But this
   913 true\ldots but of course there is always a caveat---nothing is ever
   920 \emph{is} true\ldots but of course there is always a caveat---nothing
   914 for free in life.
   921 ever is for free in life.
   915 
   922 
   916 There are a number of techniques for transforming a NFA into an
   923 There are actually a number of methods for transforming a NFA into
   917 equivalent DFA, but the most famous one is the \emph{subset
   924 an equivalent DFA, but the most famous one is the \emph{subset
   918   construction}. Consider the following NFA where the states are
   925   construction}. Consider the following NFA where the states are
   919 labelled with, say, $0$, $1$ and $2$.
   926 labelled with $0$, $1$ and $2$.
   920 
   927 
   921 \begin{center}
   928 \begin{center}
   922 \begin{tabular}{c@{\hspace{10mm}}c}
   929 \begin{tabular}{c@{\hspace{10mm}}c}
   923 \begin{tikzpicture}[scale=0.7,>=stealth',very thick,
   930 \begin{tikzpicture}[scale=0.7,>=stealth',very thick,
   924                     every state/.style={minimum size=0pt,
   931                     every state/.style={minimum size=0pt,
   947 \end{tabular}
   954 \end{tabular}
   948 \end{tabular}
   955 \end{tabular}
   949 \end{center}
   956 \end{center}
   950 
   957 
   951 \noindent The states of the corresponding DFA are given by generating
   958 \noindent The states of the corresponding DFA are given by generating
   952 all subsets of the set of states of the NFA (seen in the states column
   959 all subsets of the set $\{0,1,2\}$ (seen in the states column
   953 in the table on the right). The other columns define the transition
   960 in the table on the right). The other columns define the transition
   954 function for the DFA for input $a$ and $b$. The first row states that
   961 function for the DFA for inputs $a$ and $b$. The first row states that
   955 $\{\}$ is the sink state which has transitions for $a$ and $b$ to
   962 $\{\}$ is the sink state which has transitions for $a$ and $b$ to
   956 itself.  The next three lines are calculated as follows:
   963 itself.  The next three lines are calculated as follows:
   957 
   964 
   958 \begin{itemize}
   965 \begin{itemize}
   959 \item Suppose you calculate the entry for the $a$-transition for state
   966 \item Suppose you calculate the entry for the $a$-transition for state
   962   state $\{0\}$ we can go to state $\{0\}$ via an $a$-transition.
   969   state $\{0\}$ we can go to state $\{0\}$ via an $a$-transition.
   963 \item Do the same for the $b$-transition; you can reach states $0$ and
   970 \item Do the same for the $b$-transition; you can reach states $0$ and
   964   $1$ in the NFA; therefore in the DFA we can go from state $\{0\}$ to
   971   $1$ in the NFA; therefore in the DFA we can go from state $\{0\}$ to
   965   state $\{0,1\}$ via an $b$-transition.
   972   state $\{0,1\}$ via an $b$-transition.
   966 \item Continue with the states $\{1\}$ and $\{2\}$.
   973 \item Continue with the states $\{1\}$ and $\{2\}$.
   967 \item Once you filled in the transitions for `simple' state, you only
       
   968   have to build the union for the compound states $\{0,1\}$, $\{0,2\}$
       
   969   and so on. For example for $\{0,1\}$ you take the union of line
       
   970   $\{0\}$ and line $\{1\}$, which gives $\{0,2\}$ for $a$, and
       
   971   $\{0,1,2\}$ for $b$. And so on.
       
   972 \item The starting state of the DFA can be calculated from the
       
   973   starting states of the NFA, that is in this case $0$. But in general
       
   974   there can be many starting states in the NFA and you would take the
       
   975   corresponding subset as \emph{the} starting state of the DFA.
       
   976 \item The accepting states in the DFA are given by all sets that
       
   977   contain a $2$, which is the only accpting state in this NFA. But
       
   978   again in general if the subset contains an accepting state from the
       
   979   NFA, then the corresponding state in the DFA is accepting as well.
       
   980 \end{itemize}
   974 \end{itemize}
   981 
   975 
   982 \noindent This completes the subset construction. The corresponding
   976 \noindent
   983 DFA for the NFA shown above is:
   977 Once you filled in the transitions for `simple' states $\{0\}$
   984 
   978 .. $\{2\}$, you only have to build the union for the compound states
   985 \begin{center}
   979 $\{0,1\}$, $\{0,2\}$ and so on. For example for $\{0,1\}$ you take the
       
   980 union of Line $\{0\}$ and Line $\{1\}$, which gives $\{0,2\}$ for $a$,
       
   981 and $\{0,1,2\}$ for $b$. And so on.
       
   982 
       
   983 The starting state of the DFA can be calculated from the starting
       
   984 states of the NFA, that is in this case $\{0\}$. But in general there
       
   985 can of course be many starting states in the NFA and you would take
       
   986 the corresponding subset as \emph{the} starting state of the DFA.
       
   987 
       
   988 The accepting states in the DFA are given by all sets that contain a
       
   989 $2$, which is the only accpting state in this NFA. But again in
       
   990 general if the subset contains any accepting state from the NFA, then
       
   991 the corresponding state in the DFA is accepting as well.  This
       
   992 completes the subset construction. The corresponding DFA for the NFA
       
   993 shown above is:
       
   994 
       
   995 \begin{equation}
   986 \begin{tikzpicture}[scale=0.8,>=stealth',very thick,
   996 \begin{tikzpicture}[scale=0.8,>=stealth',very thick,
   987                     every state/.style={minimum size=0pt,
   997                     every state/.style={minimum size=0pt,
   988                     draw=blue!50,very thick,fill=blue!20},
   998                       draw=blue!50,very thick,fill=blue!20},
   989                     baseline=0mm]
   999                     baseline=(current bounding box.center)]
   990 \node[state,initial]  (q0)  {$0$};
  1000 \node[state,initial]  (q0)  {$0$};
   991 \node[state] (q01) [right=of q0] {$0,1$};
  1001 \node[state] (q01) [right=of q0] {$0,1$};
   992 \node[state,accepting] (q02) [below=of q01] {$0,2$};
  1002 \node[state,accepting] (q02) [below=of q01] {$0,2$};
   993 \node[state,accepting] (q012) [right=of q02] {$0,1,2$};
  1003 \node[state,accepting] (q012) [right=of q02] {$0,1,2$};
   994 \node[state] (q1) [below=0.5cm of q0] {$1$};
  1004 \node[state] (q1) [below=0.5cm of q0] {$1$};
  1006 \path[->] (q02) edge [bend left] node [right]  {$b$} (q01);
  1016 \path[->] (q02) edge [bend left] node [right]  {$b$} (q01);
  1007 \path[->] (q1) edge node [left] {$a,b$} (q2);
  1017 \path[->] (q1) edge node [left] {$a,b$} (q2);
  1008 \path[->] (q12) edge node [right] {$a, b$} (q2);
  1018 \path[->] (q12) edge node [right] {$a, b$} (q2);
  1009 \path[->] (q2) edge node [right] {$a, b$} (qn);
  1019 \path[->] (q2) edge node [right] {$a, b$} (qn);
  1010 \path[->] (qn) edge [loop left] node  {$a,b$} ();
  1020 \path[->] (qn) edge [loop left] node  {$a,b$} ();
  1011 \end{tikzpicture}
  1021 \end{tikzpicture}\label{subsetdfa}
  1012 \end{center}
  1022 \end{equation}
  1013 
  1023 
  1014 \noindent
  1024 \noindent
  1015 Please check that this is indeed a DFA. The big question is whether
  1025 Please check that this is indeed a DFA. The big question is whether
  1016 this DFA can recognise the same language as the NFA we started with.
  1026 this DFA can recognise the same language as the NFA we started with?
  1017 I let you ponder about this question.
  1027 I let you ponder about this question.
  1018 
  1028 
  1019 
  1029 
  1020 There are also two points to note: One is that very often the
  1030 There are also two points to note: One is that very often in the
  1021 resulting DFA contains a number of ``dead'' states that are never
  1031 subset construction the resulting DFA contains a number of ``dead''
  1022 reachable from the starting state. This is obvious in this case, where
  1032 states that are never reachable from the starting state. This is
  1023 state $\{1\}$, $\{2\}$, $\{1,2\}$ and $\{\}$ can never be reached from
  1033 obvious in the example, where state $\{1\}$, $\{2\}$, $\{1,2\}$ and
  1024 the starting state.  In effect the DFA in this example is not a
  1034 $\{\}$ can never be reached from the starting state. But this might
  1025 \emph{minimal} DFA (more about this in a minute). Such dead states can
  1035 not always be as obvious as that. In effect the DFA in this example is
  1026 be safely removed without changing the language that is recognised by
  1036 not a \emph{minimal} DFA (more about this in a minute). Such dead
  1027 the DFA. Another point is that in some cases, however, the subset
  1037 states can be safely removed without changing the language that is
  1028 construction produces a DFA that does \emph{not} contain any dead
  1038 recognised by the DFA. Another point is that in some cases, however,
  1029 states\ldots{}and further calculates a minimal DFA. Which in turn
  1039 the subset construction produces a DFA that does \emph{not} contain
  1030 means that in some cases the number of states can by going from NFAs
  1040 any dead states\ldots{}this means it calculates a minimal DFA. Which
  1031 to DFAs exponentially increase, namely by $2^n$ (which is the number
  1041 in turn means that in some cases the number of states can by going
  1032 of subsets you can form for $n$ states).  This blow up the number of
  1042 from NFAs to DFAs exponentially increase, namely by $2^n$ (which is
  1033 states in the DFA is again bad news for how quickly you can decide
  1043 the number of subsets you can form for sets of $n$ states).  This blow
  1034 whether a string is accepted by a DFA or not. So the caveat with DFAs
  1044 up in the number of states in the DFA is again bad news for how
  1035 is that they might make the task of finding the next state trival, but
  1045 quickly you can decide whether a string is accepted by a DFA or
  1036 might require $2^n$ times as many states as a NFA.\bigskip
  1046 not. So the caveat with DFAs is that they might make the task of
  1037 
  1047 finding the next state trival, but might require $2^n$ times as many
  1038 Lastly, can we 
  1048 states then a NFA.\bigskip
       
  1049 
       
  1050 \noindent
       
  1051 To conclude this section, how conveniently we can
       
  1052 implement the subset construction with our versions of NFAs and
       
  1053 DFAs? Very conveninetly. The code is just:
  1039 
  1054 
  1040 {\small\begin{lstlisting}[language=Scala]
  1055 {\small\begin{lstlisting}[language=Scala]
  1041 def subset[A, C](nfa: NFA[A, C]) : DFA[Set[A], C] = {
  1056 def subset[A, C](nfa: NFA[A, C]) : DFA[Set[A], C] = {
  1042   DFA(nfa.starts, 
  1057   DFA(nfa.starts, 
  1043       { case (qs, c) => nfa.nexts(qs, c) }, 
  1058       { case (qs, c) => nfa.nexts(qs, c) }, 
  1044       _.exists(nfa.fins))
  1059       _.exists(nfa.fins))
  1045 }
  1060 }
  1046 \end{lstlisting}}  
  1061 \end{lstlisting}}  
  1047 
  1062 
  1048 
  1063 \noindent
       
  1064 The interesting point in this code is that the state type of the
       
  1065 calculated DFA is \texttt{Set[A]}. Think carefully that this works out
       
  1066 correctly.
       
  1067 
       
  1068 The DFA is then given by three components: the starting states, the
       
  1069 transition function and the accepting-states function.  The starting
       
  1070 states are a set in the given NFA, but a single state in the DFA.  The
       
  1071 transition function, given the state \texttt{qs} and input \texttt{c},
       
  1072 needs to produce the next state: this is the set of all NFA states
       
  1073 that are reachable from each state in \texttt{qs}. The function
       
  1074 \texttt{nexts} from the NFA class already calculates this for us. The
       
  1075 accepting-states function for the DFA is true henevner at least one
       
  1076 state in the subset is accepting (that is true) in the NFA.\medskip
       
  1077 
       
  1078 \noindent
       
  1079 You might be able to spend some quality tinkering with this code and
       
  1080 time to ponder. Then you will probably notice it is actually
       
  1081 silly. The whole point of translating the NFA into a DFA via the
       
  1082 subset construction is to make the decision of whether a string is
       
  1083 accepted or not faster. Given the code above, the generated DFA will
       
  1084 be exactly as fast, or as slow, as the NFA we started with (actually
       
  1085 it will even be a tiny bit slower). The reason is that we just re-use
       
  1086 the \texttt{nexts} function from the NFA. This fucntion implements the
       
  1087 non-deterministic breadth-first search.  You might be thinking: That
       
  1088 is cheating! \ldots{} Well, not quite as you will see later, but in
       
  1089 terms of speed we still need to work a bit in order to get
       
  1090 sometimes(!) a faster DFA. Let's do this next.
  1049 
  1091 
  1050 \subsection*{DFA Minimisation}
  1092 \subsection*{DFA Minimisation}
  1051 
  1093 
  1052 As seen in the subset construction, the translation 
  1094 As seen in \eqref{subsetdfa}, the subset construction from NFA to a
  1053 of a NFA to a DFA can result in a rather ``inefficient'' 
  1095 DFA can result in a rather ``inefficient'' DFA. Meaning there are
  1054 DFA. Meaning there are states that are not needed. A
  1096 states that are not needed. There are two kinds of such unneeded
  1055 DFA can be \emph{minimised} by the following algorithm:
  1097 states: \emph{unreachable} states and \emph{nondistinguishable}
       
  1098 states. The first kind of states can just be removed without affecting
       
  1099 the language that can be recognised (after all they are
       
  1100 unreachable). The second kind can also be recognised and thus a DFA
       
  1101 can be \emph{minimised} by the following algorithm:
  1056 
  1102 
  1057 \begin{enumerate}
  1103 \begin{enumerate}
  1058 \item Take all pairs $(q, p)$ with $q \not= p$
  1104 \item Take all pairs $(q, p)$ with $q \not= p$
  1059 \item Mark all pairs that accepting and non-accepting states
  1105 \item Mark all pairs that accepting and non-accepting states
  1060 \item For all unmarked pairs $(q, p)$ and all characters $c$
  1106 \item For all unmarked pairs $(q, p)$ and all characters $c$
  1067       are marked. If there is one, then also mark $(q, p)$.
  1113       are marked. If there is one, then also mark $(q, p)$.
  1068 \item Repeat last step until no change.
  1114 \item Repeat last step until no change.
  1069 \item All unmarked pairs can be merged.
  1115 \item All unmarked pairs can be merged.
  1070 \end{enumerate}
  1116 \end{enumerate}
  1071 
  1117 
  1072 \noindent To illustrate this algorithm, consider the following 
  1118 \noindent Unfortunately, once we throw away all unreachable states in
  1073 DFA.
  1119 \eqref{subsetdfa}, all remaining states are needed.  In order to
       
  1120 illustrate the minimisation algorithm, consider the following DFA.
  1074 
  1121 
  1075 \begin{center}
  1122 \begin{center}
  1076 \begin{tikzpicture}[>=stealth',very thick,auto,
  1123 \begin{tikzpicture}[>=stealth',very thick,auto,
  1077                     every state/.style={minimum size=0pt,
  1124                     every state/.style={minimum size=0pt,
  1078                     inner sep=2pt,draw=blue!50,very thick,
  1125                     inner sep=2pt,draw=blue!50,very thick,
  1132 \noindent where the lower row is filled with stars, because in
  1179 \noindent where the lower row is filled with stars, because in
  1133 the corresponding pairs there is always one state that is
  1180 the corresponding pairs there is always one state that is
  1134 accepting ($Q_4$) and a state that is non-accepting (the other
  1181 accepting ($Q_4$) and a state that is non-accepting (the other
  1135 states).
  1182 states).
  1136 
  1183 
  1137 Now in Step 3 we need to fill in more stars according whether 
  1184 In Step 3 we need to fill in more stars according whether 
  1138 one of the next-state pairs are marked. We have to do this 
  1185 one of the next-state pairs are marked. We have to do this 
  1139 for every unmarked field until there is no change anymore.
  1186 for every unmarked field until there is no change anymore.
  1140 This gives the triangle
  1187 This gives the triangle
  1141 
  1188 
  1142 \begin{center}
  1189 \begin{center}
  1195 \end{center}
  1242 \end{center}
  1196 
  1243 
  1197 
  1244 
  1198 \subsection*{Brzozowski's Method}
  1245 \subsection*{Brzozowski's Method}
  1199 
  1246 
  1200 As said before, we can also go into the other direction---from
  1247 I know tyhis is already a long, long rant: but after all it is a topic
  1201 DFAs to regular expressions. Brzozowski's method calculates
  1248 that has been researched for more than 60 years. If you reflect on
  1202 a regular expression using familiar transformations for
  1249 what you have read so far, the story you can take a regular
  1203 solving equational systems. Consider the DFA:
  1250 expression, translate it via the Thompson Construction into an
       
  1251 $\epsilon$NFA, then translate it into a NFA by removing all
       
  1252 $\epsilon$-transitions, and then via the subset construction obtain a
       
  1253 DFA. In all steps we made sure the language, or which strings can be
       
  1254 recognised, stays the same. After the last section, we can even
       
  1255 minimise the DFA. But again we made sure the same language is
       
  1256 recognised. You might be wondering: Can we go into the other
       
  1257 direction?  Can we go from a DFA and obtain a regular expression that
       
  1258 can recognise the same language as the DFA?\medskip
       
  1259 
       
  1260 \noindent
       
  1261 The answer is yes. Again there are several methods for calculating a
       
  1262 regular expression for a DFA. I will show you Brzozowski's method
       
  1263 because it calculates a regular expression using quite familiar
       
  1264 transformations for solving equational systems. Consider the DFA:
  1204 
  1265 
  1205 \begin{center}
  1266 \begin{center}
  1206 \begin{tikzpicture}[scale=1.5,>=stealth',very thick,auto,
  1267 \begin{tikzpicture}[scale=1.5,>=stealth',very thick,auto,
  1207                     every state/.style={minimum size=0pt,
  1268                     every state/.style={minimum size=0pt,
  1208                     inner sep=2pt,draw=blue!50,very thick,
  1269                     inner sep=2pt,draw=blue!50,very thick,
  1327 \end{eqnarray}
  1388 \end{eqnarray}
  1328 
  1389 
  1329 \noindent Finally, we only need to ``add'' up the equations
  1390 \noindent Finally, we only need to ``add'' up the equations
  1330 which correspond to a terminal state. In our running example,
  1391 which correspond to a terminal state. In our running example,
  1331 this is just $Q_2$. Consequently, a regular expression
  1392 this is just $Q_2$. Consequently, a regular expression
  1332 that recognises the same language as the automaton is
  1393 that recognises the same language as the DFA is
  1333 
  1394 
  1334 \[
  1395 \[
  1335 (b + a\,b + a\,a\,(a^*)\,b)^*\,a\,a\,(a)^*
  1396 (b + a\,b + a\,a\,(a^*)\,b)^*\,a\,a\,(a)^*
  1336 \]
  1397 \]
  1337 
  1398 
  1338 \noindent You can somewhat crosscheck your solution
  1399 \noindent You can somewhat crosscheck your solution by taking a string
  1339 by taking a string the regular expression can match and
  1400 the regular expression can match and and see whether it can be matched
  1340 and see whether it can be matched by the automaton.
  1401 by the DFA.  One string for example is $aaa$ and \emph{voila} this
  1341 One string for example is $aaa$ and \emph{voila} this 
       
  1342 string is also matched by the automaton.
  1402 string is also matched by the automaton.
  1343 
  1403 
  1344 We should prove that Brzozowski's method really produces
  1404 We should prove that Brzozowski's method really produces an equivalent
  1345 an equivalent  regular expression for the automaton. But
  1405 regular expression. But for the purposes of this module, we omit
  1346 for the purposes of this module, we omit this.
  1406 this. I guess you are relieved.
  1347 
  1407 
  1348 
  1408 
  1349 \subsection*{Regular Languages}
  1409 \subsection*{Regular Languages}
  1350 
  1410 
  1351 Given the constructions in the previous sections we obtain 
  1411 Given the constructions in the previous sections we obtain 
  1370 Although we did not prove this fact. Similarly by going from
  1430 Although we did not prove this fact. Similarly by going from
  1371 DFAs to regular expressions, we can make sure for every DFA 
  1431 DFAs to regular expressions, we can make sure for every DFA 
  1372 there exists a regular expression that can recognise the same
  1432 there exists a regular expression that can recognise the same
  1373 language. Again we did not prove this fact. 
  1433 language. Again we did not prove this fact. 
  1374 
  1434 
  1375 The interesting conclusion is that automata and regular 
  1435 The fundamental conclusion we can draw is that automata and regular
  1376 expressions can recognise the same set of languages:
  1436 expressions can recognise the same set of languages:
  1377 
  1437 
  1378 \begin{quote} A language is \emph{regular} iff there exists a
  1438 \begin{quote} A language is \emph{regular} iff there exists a
  1379 regular expression that recognises all its strings.
  1439 regular expression that recognises all its strings.
  1380 \end{quote}
  1440 \end{quote}
  1383 
  1443 
  1384 \begin{quote} A language is \emph{regular} iff there exists an
  1444 \begin{quote} A language is \emph{regular} iff there exists an
  1385 automaton that recognises all its strings.
  1445 automaton that recognises all its strings.
  1386 \end{quote}
  1446 \end{quote}
  1387 
  1447 
  1388 \noindent So for deciding whether a string is recognised by a
  1448 \noindent Note that this is not a stement for a particular language
  1389 regular expression, we could use our algorithm based on
  1449 (that is a particular set of strings), but about a large class of
  1390 derivatives or NFAs or DFAs. But let us quickly look at what
  1450 languages, namely the regular ones.
  1391 the differences mean in computational terms. Translating a
  1451 
  1392 regular expression into a NFA gives us an automaton that has
  1452 As a consequence for deciding whether a string is recognised by a
  1393 $O(n)$ states---that means the size of the NFA grows linearly
  1453 regular expression, we could use our algorithm based on derivatives or
  1394 with the size of the regular expression. The problem with NFAs
  1454 NFAs or DFAs. But let us quickly look at what the differences mean in
  1395 is that the problem of deciding whether a string is accepted
  1455 computational terms. Translating a regular expression into a NFA gives
  1396 or not is computationally not cheap. Remember with NFAs we
  1456 us an automaton that has $O(n)$ states---that means the size of the
  1397 have potentially many next states even for the same input and
  1457 NFA grows linearly with the size of the regular expression. The
  1398 also have the silent $\epsilon$-transitions. If we want to
  1458 problem with NFAs is that the problem of deciding whether a string is
  1399 find a path from the starting state of a NFA to an accepting
  1459 accepted or not is computationally not cheap. Remember with NFAs we
  1400 state, we need to consider all possibilities. In Ruby and
  1460 have potentially many next states even for the same input and also
  1401 Python this is done by a depth-first search, which in turn
  1461 have the silent $\epsilon$-transitions. If we want to find a path from
  1402 means that if a ``wrong'' choice is made, the algorithm has to
  1462 the starting state of a NFA to an accepting state, we need to consider
  1403 backtrack and thus explore all potential candidates. This is
  1463 all possibilities. In Ruby, Python and Java this is done by a
  1404 exactly the reason why Ruby and Python are so slow for evil
  1464 depth-first search, which in turn means that if a ``wrong'' choice is
  1405 regular expressions. An alternative to the potentially slow
  1465 made, the algorithm has to backtrack and thus explore all potential
  1406 depth-first search is to explore the search space in a
  1466 candidates. This is exactly the reason why Ruby, Python and Java are
  1407 breadth-first fashion, but this might incur a big memory
  1467 so slow for evil regular expressions. An alternative to the
  1408 penalty.  
  1468 potentially slow depth-first search is to explore the search space in
  1409 
  1469 a breadth-first fashion, but this might incur a big memory penalty.
  1410 To avoid the problems with NFAs, we can translate them 
  1470 
  1411 into DFAs. With DFAs the problem of deciding whether a
  1471 To avoid the problems with NFAs, we can translate them into DFAs. With
  1412 string is recognised or not is much simpler, because in
  1472 DFAs the problem of deciding whether a string is recognised or not is
  1413 each state it is completely determined what the next
  1473 much simpler, because in each state it is completely determined what
  1414 state will be for a given input. So no search is needed.
  1474 the next state will be for a given input. So no search is needed.  The
  1415 The problem with this is that the translation to DFAs
  1475 problem with this is that the translation to DFAs can explode
  1416 can explode exponentially the number of states. Therefore when 
  1476 exponentially the number of states. Therefore when this route is
  1417 this route is taken, we definitely need to minimise the
  1477 taken, we definitely need to minimise the resulting DFAs in order to
  1418 resulting DFAs in order to have an acceptable memory 
  1478 have an acceptable memory and runtime behaviour. But remember the
  1419 and runtime behaviour. But remember the subset construction
  1479 subset construction in the worst case explodes the number of states by
  1420 in the worst case explodes the number of states by $2^n$.
  1480 $2^n$.  Effectively also the translation to DFAs can incur a big
  1421 Effectively also the translation to DFAs can incur a big
       
  1422 runtime penalty.
  1481 runtime penalty.
  1423 
  1482 
  1424 But this does not mean that everything is bad with automata.
  1483 But this does not mean that everything is bad with automata.  Recall
  1425 Recall the problem of finding a regular expressions for the
  1484 the problem of finding a regular expressions for the language that is
  1426 language that is \emph{not} recognised by a regular
  1485 \emph{not} recognised by a regular expression. In our implementation
  1427 expression. In our implementation we added explicitly such a
  1486 we added explicitly such a regular expressions because they are useful
  1428 regular expressions because they are useful for recognising
  1487 for recognising comments. But in principle we did not need to. The
  1429 comments. But in principle we did not need to. The argument
  1488 argument for this is as follows: take a regular expression, translate
  1430 for this is as follows: take a regular expression, translate
       
  1431 it into a NFA and then a DFA that both recognise the same
  1489 it into a NFA and then a DFA that both recognise the same
  1432 language. Once you have the DFA it is very easy to construct
  1490 language. Once you have the DFA it is very easy to construct the
  1433 the automaton for the language not recognised by a DFA. If
  1491 automaton for the language not recognised by a DFA. If the DFA is
  1434 the DFA is completed (this is important!), then you just need
  1492 completed (this is important!), then you just need to exchange the
  1435 to exchange the accepting and non-accepting states. You can
  1493 accepting and non-accepting states. You can then translate this DFA
  1436 then translate this DFA back into a regular expression and
  1494 back into a regular expression and that will be the regular expression
  1437 that will be the regular expression that can match all strings
  1495 that can match all strings the original regular expression could
  1438 the original regular expression could \emph{not} match.
  1496 \emph{not} match.
  1439 
  1497 
  1440 It is also interesting that not all languages are regular. The
  1498 It is also interesting that not all languages are regular. The most
  1441 most well-known example of a language that is not regular
  1499 well-known example of a language that is not regular consists of all
  1442 consists of all the strings of the form
  1500 the strings of the form
  1443 
  1501 
  1444 \[a^n\,b^n\]
  1502 \[a^n\,b^n\]
  1445 
  1503 
  1446 \noindent meaning strings that have the same number of $a$s
  1504 \noindent meaning strings that have the same number of $a$s and
  1447 and $b$s. You can try, but you cannot find a regular
  1505 $b$s. You can try, but you cannot find a regular expression for this
  1448 expression for this language and also not an automaton. One
  1506 language and also not an automaton. One can actually prove that there
  1449 can actually prove that there is no regular expression nor
  1507 is no regular expression nor automaton for this language, but again
  1450 automaton for this language, but again that would lead us too
  1508 that would lead us too far afield for what we want to do in this
  1451 far afield for what we want to do in this module.
  1509 module.
  1452 
  1510 
  1453 %\section*{Further Reading}
  1511 %\section*{Further Reading}
  1454 
  1512 
  1455 %Compare what a ``human expert'' would create as an automaton for the
  1513 %Compare what a ``human expert'' would create as an automaton for the
  1456 %regular expression $a\cdot (b + c)^*$ and what the Thomson
  1514 %regular expression $a\cdot (b + c)^*$ and what the Thomson