884 |
890 |
885 |
891 |
886 |
892 |
887 \subsection*{Subset Construction} |
893 \subsection*{Subset Construction} |
888 |
894 |
889 Of course, some developers of regular expression matchers are aware |
895 Of course, some developers of regular expression matchers are aware of |
890 of these problems with sluggish NFAs and try to address them. One |
896 these problems with sluggish NFAs and try to address them. One common |
891 common technique for this I like to show you in this section. It will |
897 technique for alleviating the problem I like to show you in this |
892 also explain why we insisted on polymorphic types in our DFA code |
898 section. This will also explain why we insisted on polymorphic types in |
893 (remember I used \texttt{A} and \texttt{C} for the types of states and |
899 our DFA code (remember I used \texttt{A} and \texttt{C} for the types |
894 the input, see Figure~\ref{dfa} on Page~\pageref{dfa}).\bigskip |
900 of states and the input, see Figure~\ref{dfa} on |
|
901 Page~\pageref{dfa}).\bigskip |
895 |
902 |
896 \noindent |
903 \noindent |
897 To start, remember that we did not bother with defining and |
904 To start remember that we did not bother with defining and |
898 implementing $\epsilon$NFA; we immediately translated them into |
905 implementing $\epsilon$NFAs: we immediately translated them into |
899 equivalent NFAs. Equivalent in the sense of accepting the same |
906 equivalent NFAs. Equivalent in the sense of accepting the same |
900 language (though we only claimed this and did not prove it |
907 language (though we only claimed this and did not prove it |
901 rigorously). Remember also that NFAs have non-deterministic |
908 rigorously). Remember also that NFAs have non-deterministic |
902 transitions defined as a relation or implemented as function returning |
909 transitions defined as a relation or implemented as function returning |
903 sets of states. This non-determinism is crucial for the Thompson |
910 sets of states. This non-determinism is crucial for the Thompson |
904 Construction to work (recall the cases for $\cdot$, $+$ and |
911 Construction to work (recall the cases for $\cdot$, $+$ and |
905 ${}^*$). But this non-determinism makes it harder with NFAs to decide |
912 ${}^*$). But this non-determinism makes it harder with NFAs to decide |
906 when a string is accepted or not; such a decision is rather |
913 when a string is accepted or not; whereas such a decision is rather |
907 straightforward with DFAs: recall their transition function is a |
914 straightforward with DFAs: recall their transition function is a |
908 \emph{function} that returns a single state. So we do not have to |
915 \emph{function} that returns a single state. So with DFAs we do not |
909 search at all. What is perhaps interesting is the fact that for every |
916 have to search at all. What is perhaps interesting is the fact that |
910 NFA we can find a DFA that also recognises the same language. This |
917 for every NFA we can find a DFA that also recognises the same |
911 might sound a bit paradoxical: NFA $\rightarrow$ decision of |
918 language. This might sound a bit paradoxical: NFA $\rightarrow$ |
912 acceptance hard; DFA $\rightarrow$ decision easy. But this \emph{is} |
919 decision of acceptance hard; DFA $\rightarrow$ decision easy. But this |
913 true\ldots but of course there is always a caveat---nothing is ever |
920 \emph{is} true\ldots but of course there is always a caveat---nothing |
914 for free in life. |
921 ever is for free in life. |
915 |
922 |
916 There are a number of techniques for transforming a NFA into an |
923 There are actually a number of methods for transforming a NFA into |
917 equivalent DFA, but the most famous one is the \emph{subset |
924 an equivalent DFA, but the most famous one is the \emph{subset |
918 construction}. Consider the following NFA where the states are |
925 construction}. Consider the following NFA where the states are |
919 labelled with, say, $0$, $1$ and $2$. |
926 labelled with $0$, $1$ and $2$. |
920 |
927 |
921 \begin{center} |
928 \begin{center} |
922 \begin{tabular}{c@{\hspace{10mm}}c} |
929 \begin{tabular}{c@{\hspace{10mm}}c} |
923 \begin{tikzpicture}[scale=0.7,>=stealth',very thick, |
930 \begin{tikzpicture}[scale=0.7,>=stealth',very thick, |
924 every state/.style={minimum size=0pt, |
931 every state/.style={minimum size=0pt, |
962 state $\{0\}$ we can go to state $\{0\}$ via an $a$-transition. |
969 state $\{0\}$ we can go to state $\{0\}$ via an $a$-transition. |
963 \item Do the same for the $b$-transition; you can reach states $0$ and |
970 \item Do the same for the $b$-transition; you can reach states $0$ and |
964 $1$ in the NFA; therefore in the DFA we can go from state $\{0\}$ to |
971 $1$ in the NFA; therefore in the DFA we can go from state $\{0\}$ to |
965 state $\{0,1\}$ via an $b$-transition. |
972 state $\{0,1\}$ via an $b$-transition. |
966 \item Continue with the states $\{1\}$ and $\{2\}$. |
973 \item Continue with the states $\{1\}$ and $\{2\}$. |
967 \item Once you filled in the transitions for `simple' state, you only |
|
968 have to build the union for the compound states $\{0,1\}$, $\{0,2\}$ |
|
969 and so on. For example for $\{0,1\}$ you take the union of line |
|
970 $\{0\}$ and line $\{1\}$, which gives $\{0,2\}$ for $a$, and |
|
971 $\{0,1,2\}$ for $b$. And so on. |
|
972 \item The starting state of the DFA can be calculated from the |
|
973 starting states of the NFA, that is in this case $0$. But in general |
|
974 there can be many starting states in the NFA and you would take the |
|
975 corresponding subset as \emph{the} starting state of the DFA. |
|
976 \item The accepting states in the DFA are given by all sets that |
|
977 contain a $2$, which is the only accpting state in this NFA. But |
|
978 again in general if the subset contains an accepting state from the |
|
979 NFA, then the corresponding state in the DFA is accepting as well. |
|
980 \end{itemize} |
974 \end{itemize} |
981 |
975 |
982 \noindent This completes the subset construction. The corresponding |
976 \noindent |
983 DFA for the NFA shown above is: |
977 Once you filled in the transitions for `simple' states $\{0\}$ |
984 |
978 .. $\{2\}$, you only have to build the union for the compound states |
985 \begin{center} |
979 $\{0,1\}$, $\{0,2\}$ and so on. For example for $\{0,1\}$ you take the |
|
980 union of Line $\{0\}$ and Line $\{1\}$, which gives $\{0,2\}$ for $a$, |
|
981 and $\{0,1,2\}$ for $b$. And so on. |
|
982 |
|
983 The starting state of the DFA can be calculated from the starting |
|
984 states of the NFA, that is in this case $\{0\}$. But in general there |
|
985 can of course be many starting states in the NFA and you would take |
|
986 the corresponding subset as \emph{the} starting state of the DFA. |
|
987 |
|
988 The accepting states in the DFA are given by all sets that contain a |
|
989 $2$, which is the only accpting state in this NFA. But again in |
|
990 general if the subset contains any accepting state from the NFA, then |
|
991 the corresponding state in the DFA is accepting as well. This |
|
992 completes the subset construction. The corresponding DFA for the NFA |
|
993 shown above is: |
|
994 |
|
995 \begin{equation} |
986 \begin{tikzpicture}[scale=0.8,>=stealth',very thick, |
996 \begin{tikzpicture}[scale=0.8,>=stealth',very thick, |
987 every state/.style={minimum size=0pt, |
997 every state/.style={minimum size=0pt, |
988 draw=blue!50,very thick,fill=blue!20}, |
998 draw=blue!50,very thick,fill=blue!20}, |
989 baseline=0mm] |
999 baseline=(current bounding box.center)] |
990 \node[state,initial] (q0) {$0$}; |
1000 \node[state,initial] (q0) {$0$}; |
991 \node[state] (q01) [right=of q0] {$0,1$}; |
1001 \node[state] (q01) [right=of q0] {$0,1$}; |
992 \node[state,accepting] (q02) [below=of q01] {$0,2$}; |
1002 \node[state,accepting] (q02) [below=of q01] {$0,2$}; |
993 \node[state,accepting] (q012) [right=of q02] {$0,1,2$}; |
1003 \node[state,accepting] (q012) [right=of q02] {$0,1,2$}; |
994 \node[state] (q1) [below=0.5cm of q0] {$1$}; |
1004 \node[state] (q1) [below=0.5cm of q0] {$1$}; |
1006 \path[->] (q02) edge [bend left] node [right] {$b$} (q01); |
1016 \path[->] (q02) edge [bend left] node [right] {$b$} (q01); |
1007 \path[->] (q1) edge node [left] {$a,b$} (q2); |
1017 \path[->] (q1) edge node [left] {$a,b$} (q2); |
1008 \path[->] (q12) edge node [right] {$a, b$} (q2); |
1018 \path[->] (q12) edge node [right] {$a, b$} (q2); |
1009 \path[->] (q2) edge node [right] {$a, b$} (qn); |
1019 \path[->] (q2) edge node [right] {$a, b$} (qn); |
1010 \path[->] (qn) edge [loop left] node {$a,b$} (); |
1020 \path[->] (qn) edge [loop left] node {$a,b$} (); |
1011 \end{tikzpicture} |
1021 \end{tikzpicture}\label{subsetdfa} |
1012 \end{center} |
1022 \end{equation} |
1013 |
1023 |
1014 \noindent |
1024 \noindent |
1015 Please check that this is indeed a DFA. The big question is whether |
1025 Please check that this is indeed a DFA. The big question is whether |
1016 this DFA can recognise the same language as the NFA we started with. |
1026 this DFA can recognise the same language as the NFA we started with? |
1017 I let you ponder about this question. |
1027 I let you ponder about this question. |
1018 |
1028 |
1019 |
1029 |
1020 There are also two points to note: One is that very often the |
1030 There are also two points to note: One is that very often in the |
1021 resulting DFA contains a number of ``dead'' states that are never |
1031 subset construction the resulting DFA contains a number of ``dead'' |
1022 reachable from the starting state. This is obvious in this case, where |
1032 states that are never reachable from the starting state. This is |
1023 state $\{1\}$, $\{2\}$, $\{1,2\}$ and $\{\}$ can never be reached from |
1033 obvious in the example, where state $\{1\}$, $\{2\}$, $\{1,2\}$ and |
1024 the starting state. In effect the DFA in this example is not a |
1034 $\{\}$ can never be reached from the starting state. But this might |
1025 \emph{minimal} DFA (more about this in a minute). Such dead states can |
1035 not always be as obvious as that. In effect the DFA in this example is |
1026 be safely removed without changing the language that is recognised by |
1036 not a \emph{minimal} DFA (more about this in a minute). Such dead |
1027 the DFA. Another point is that in some cases, however, the subset |
1037 states can be safely removed without changing the language that is |
1028 construction produces a DFA that does \emph{not} contain any dead |
1038 recognised by the DFA. Another point is that in some cases, however, |
1029 states\ldots{}and further calculates a minimal DFA. Which in turn |
1039 the subset construction produces a DFA that does \emph{not} contain |
1030 means that in some cases the number of states can by going from NFAs |
1040 any dead states\ldots{}this means it calculates a minimal DFA. Which |
1031 to DFAs exponentially increase, namely by $2^n$ (which is the number |
1041 in turn means that in some cases the number of states can by going |
1032 of subsets you can form for $n$ states). This blow up the number of |
1042 from NFAs to DFAs exponentially increase, namely by $2^n$ (which is |
1033 states in the DFA is again bad news for how quickly you can decide |
1043 the number of subsets you can form for sets of $n$ states). This blow |
1034 whether a string is accepted by a DFA or not. So the caveat with DFAs |
1044 up in the number of states in the DFA is again bad news for how |
1035 is that they might make the task of finding the next state trival, but |
1045 quickly you can decide whether a string is accepted by a DFA or |
1036 might require $2^n$ times as many states as a NFA.\bigskip |
1046 not. So the caveat with DFAs is that they might make the task of |
1037 |
1047 finding the next state trival, but might require $2^n$ times as many |
1038 Lastly, can we |
1048 states then a NFA.\bigskip |
|
1049 |
|
1050 \noindent |
|
1051 To conclude this section, how conveniently we can |
|
1052 implement the subset construction with our versions of NFAs and |
|
1053 DFAs? Very conveninetly. The code is just: |
1039 |
1054 |
1040 {\small\begin{lstlisting}[language=Scala] |
1055 {\small\begin{lstlisting}[language=Scala] |
1041 def subset[A, C](nfa: NFA[A, C]) : DFA[Set[A], C] = { |
1056 def subset[A, C](nfa: NFA[A, C]) : DFA[Set[A], C] = { |
1042 DFA(nfa.starts, |
1057 DFA(nfa.starts, |
1043 { case (qs, c) => nfa.nexts(qs, c) }, |
1058 { case (qs, c) => nfa.nexts(qs, c) }, |
1044 _.exists(nfa.fins)) |
1059 _.exists(nfa.fins)) |
1045 } |
1060 } |
1046 \end{lstlisting}} |
1061 \end{lstlisting}} |
1047 |
1062 |
1048 |
1063 \noindent |
|
1064 The interesting point in this code is that the state type of the |
|
1065 calculated DFA is \texttt{Set[A]}. Think carefully that this works out |
|
1066 correctly. |
|
1067 |
|
1068 The DFA is then given by three components: the starting states, the |
|
1069 transition function and the accepting-states function. The starting |
|
1070 states are a set in the given NFA, but a single state in the DFA. The |
|
1071 transition function, given the state \texttt{qs} and input \texttt{c}, |
|
1072 needs to produce the next state: this is the set of all NFA states |
|
1073 that are reachable from each state in \texttt{qs}. The function |
|
1074 \texttt{nexts} from the NFA class already calculates this for us. The |
|
1075 accepting-states function for the DFA is true henevner at least one |
|
1076 state in the subset is accepting (that is true) in the NFA.\medskip |
|
1077 |
|
1078 \noindent |
|
1079 You might be able to spend some quality tinkering with this code and |
|
1080 time to ponder. Then you will probably notice it is actually |
|
1081 silly. The whole point of translating the NFA into a DFA via the |
|
1082 subset construction is to make the decision of whether a string is |
|
1083 accepted or not faster. Given the code above, the generated DFA will |
|
1084 be exactly as fast, or as slow, as the NFA we started with (actually |
|
1085 it will even be a tiny bit slower). The reason is that we just re-use |
|
1086 the \texttt{nexts} function from the NFA. This fucntion implements the |
|
1087 non-deterministic breadth-first search. You might be thinking: That |
|
1088 is cheating! \ldots{} Well, not quite as you will see later, but in |
|
1089 terms of speed we still need to work a bit in order to get |
|
1090 sometimes(!) a faster DFA. Let's do this next. |
1049 |
1091 |
1050 \subsection*{DFA Minimisation} |
1092 \subsection*{DFA Minimisation} |
1051 |
1093 |
1052 As seen in the subset construction, the translation |
1094 As seen in \eqref{subsetdfa}, the subset construction from NFA to a |
1053 of a NFA to a DFA can result in a rather ``inefficient'' |
1095 DFA can result in a rather ``inefficient'' DFA. Meaning there are |
1054 DFA. Meaning there are states that are not needed. A |
1096 states that are not needed. There are two kinds of such unneeded |
1055 DFA can be \emph{minimised} by the following algorithm: |
1097 states: \emph{unreachable} states and \emph{nondistinguishable} |
|
1098 states. The first kind of states can just be removed without affecting |
|
1099 the language that can be recognised (after all they are |
|
1100 unreachable). The second kind can also be recognised and thus a DFA |
|
1101 can be \emph{minimised} by the following algorithm: |
1056 |
1102 |
1057 \begin{enumerate} |
1103 \begin{enumerate} |
1058 \item Take all pairs $(q, p)$ with $q \not= p$ |
1104 \item Take all pairs $(q, p)$ with $q \not= p$ |
1059 \item Mark all pairs that accepting and non-accepting states |
1105 \item Mark all pairs that accepting and non-accepting states |
1060 \item For all unmarked pairs $(q, p)$ and all characters $c$ |
1106 \item For all unmarked pairs $(q, p)$ and all characters $c$ |
1195 \end{center} |
1242 \end{center} |
1196 |
1243 |
1197 |
1244 |
1198 \subsection*{Brzozowski's Method} |
1245 \subsection*{Brzozowski's Method} |
1199 |
1246 |
1200 As said before, we can also go into the other direction---from |
1247 I know tyhis is already a long, long rant: but after all it is a topic |
1201 DFAs to regular expressions. Brzozowski's method calculates |
1248 that has been researched for more than 60 years. If you reflect on |
1202 a regular expression using familiar transformations for |
1249 what you have read so far, the story you can take a regular |
1203 solving equational systems. Consider the DFA: |
1250 expression, translate it via the Thompson Construction into an |
|
1251 $\epsilon$NFA, then translate it into a NFA by removing all |
|
1252 $\epsilon$-transitions, and then via the subset construction obtain a |
|
1253 DFA. In all steps we made sure the language, or which strings can be |
|
1254 recognised, stays the same. After the last section, we can even |
|
1255 minimise the DFA. But again we made sure the same language is |
|
1256 recognised. You might be wondering: Can we go into the other |
|
1257 direction? Can we go from a DFA and obtain a regular expression that |
|
1258 can recognise the same language as the DFA?\medskip |
|
1259 |
|
1260 \noindent |
|
1261 The answer is yes. Again there are several methods for calculating a |
|
1262 regular expression for a DFA. I will show you Brzozowski's method |
|
1263 because it calculates a regular expression using quite familiar |
|
1264 transformations for solving equational systems. Consider the DFA: |
1204 |
1265 |
1205 \begin{center} |
1266 \begin{center} |
1206 \begin{tikzpicture}[scale=1.5,>=stealth',very thick,auto, |
1267 \begin{tikzpicture}[scale=1.5,>=stealth',very thick,auto, |
1207 every state/.style={minimum size=0pt, |
1268 every state/.style={minimum size=0pt, |
1208 inner sep=2pt,draw=blue!50,very thick, |
1269 inner sep=2pt,draw=blue!50,very thick, |
1327 \end{eqnarray} |
1388 \end{eqnarray} |
1328 |
1389 |
1329 \noindent Finally, we only need to ``add'' up the equations |
1390 \noindent Finally, we only need to ``add'' up the equations |
1330 which correspond to a terminal state. In our running example, |
1391 which correspond to a terminal state. In our running example, |
1331 this is just $Q_2$. Consequently, a regular expression |
1392 this is just $Q_2$. Consequently, a regular expression |
1332 that recognises the same language as the automaton is |
1393 that recognises the same language as the DFA is |
1333 |
1394 |
1334 \[ |
1395 \[ |
1335 (b + a\,b + a\,a\,(a^*)\,b)^*\,a\,a\,(a)^* |
1396 (b + a\,b + a\,a\,(a^*)\,b)^*\,a\,a\,(a)^* |
1336 \] |
1397 \] |
1337 |
1398 |
1338 \noindent You can somewhat crosscheck your solution |
1399 \noindent You can somewhat crosscheck your solution by taking a string |
1339 by taking a string the regular expression can match and |
1400 the regular expression can match and and see whether it can be matched |
1340 and see whether it can be matched by the automaton. |
1401 by the DFA. One string for example is $aaa$ and \emph{voila} this |
1341 One string for example is $aaa$ and \emph{voila} this |
|
1342 string is also matched by the automaton. |
1402 string is also matched by the automaton. |
1343 |
1403 |
1344 We should prove that Brzozowski's method really produces |
1404 We should prove that Brzozowski's method really produces an equivalent |
1345 an equivalent regular expression for the automaton. But |
1405 regular expression. But for the purposes of this module, we omit |
1346 for the purposes of this module, we omit this. |
1406 this. I guess you are relieved. |
1347 |
1407 |
1348 |
1408 |
1349 \subsection*{Regular Languages} |
1409 \subsection*{Regular Languages} |
1350 |
1410 |
1351 Given the constructions in the previous sections we obtain |
1411 Given the constructions in the previous sections we obtain |
1383 |
1443 |
1384 \begin{quote} A language is \emph{regular} iff there exists an |
1444 \begin{quote} A language is \emph{regular} iff there exists an |
1385 automaton that recognises all its strings. |
1445 automaton that recognises all its strings. |
1386 \end{quote} |
1446 \end{quote} |
1387 |
1447 |
1388 \noindent So for deciding whether a string is recognised by a |
1448 \noindent Note that this is not a stement for a particular language |
1389 regular expression, we could use our algorithm based on |
1449 (that is a particular set of strings), but about a large class of |
1390 derivatives or NFAs or DFAs. But let us quickly look at what |
1450 languages, namely the regular ones. |
1391 the differences mean in computational terms. Translating a |
1451 |
1392 regular expression into a NFA gives us an automaton that has |
1452 As a consequence for deciding whether a string is recognised by a |
1393 $O(n)$ states---that means the size of the NFA grows linearly |
1453 regular expression, we could use our algorithm based on derivatives or |
1394 with the size of the regular expression. The problem with NFAs |
1454 NFAs or DFAs. But let us quickly look at what the differences mean in |
1395 is that the problem of deciding whether a string is accepted |
1455 computational terms. Translating a regular expression into a NFA gives |
1396 or not is computationally not cheap. Remember with NFAs we |
1456 us an automaton that has $O(n)$ states---that means the size of the |
1397 have potentially many next states even for the same input and |
1457 NFA grows linearly with the size of the regular expression. The |
1398 also have the silent $\epsilon$-transitions. If we want to |
1458 problem with NFAs is that the problem of deciding whether a string is |
1399 find a path from the starting state of a NFA to an accepting |
1459 accepted or not is computationally not cheap. Remember with NFAs we |
1400 state, we need to consider all possibilities. In Ruby and |
1460 have potentially many next states even for the same input and also |
1401 Python this is done by a depth-first search, which in turn |
1461 have the silent $\epsilon$-transitions. If we want to find a path from |
1402 means that if a ``wrong'' choice is made, the algorithm has to |
1462 the starting state of a NFA to an accepting state, we need to consider |
1403 backtrack and thus explore all potential candidates. This is |
1463 all possibilities. In Ruby, Python and Java this is done by a |
1404 exactly the reason why Ruby and Python are so slow for evil |
1464 depth-first search, which in turn means that if a ``wrong'' choice is |
1405 regular expressions. An alternative to the potentially slow |
1465 made, the algorithm has to backtrack and thus explore all potential |
1406 depth-first search is to explore the search space in a |
1466 candidates. This is exactly the reason why Ruby, Python and Java are |
1407 breadth-first fashion, but this might incur a big memory |
1467 so slow for evil regular expressions. An alternative to the |
1408 penalty. |
1468 potentially slow depth-first search is to explore the search space in |
1409 |
1469 a breadth-first fashion, but this might incur a big memory penalty. |
1410 To avoid the problems with NFAs, we can translate them |
1470 |
1411 into DFAs. With DFAs the problem of deciding whether a |
1471 To avoid the problems with NFAs, we can translate them into DFAs. With |
1412 string is recognised or not is much simpler, because in |
1472 DFAs the problem of deciding whether a string is recognised or not is |
1413 each state it is completely determined what the next |
1473 much simpler, because in each state it is completely determined what |
1414 state will be for a given input. So no search is needed. |
1474 the next state will be for a given input. So no search is needed. The |
1415 The problem with this is that the translation to DFAs |
1475 problem with this is that the translation to DFAs can explode |
1416 can explode exponentially the number of states. Therefore when |
1476 exponentially the number of states. Therefore when this route is |
1417 this route is taken, we definitely need to minimise the |
1477 taken, we definitely need to minimise the resulting DFAs in order to |
1418 resulting DFAs in order to have an acceptable memory |
1478 have an acceptable memory and runtime behaviour. But remember the |
1419 and runtime behaviour. But remember the subset construction |
1479 subset construction in the worst case explodes the number of states by |
1420 in the worst case explodes the number of states by $2^n$. |
1480 $2^n$. Effectively also the translation to DFAs can incur a big |
1421 Effectively also the translation to DFAs can incur a big |
|
1422 runtime penalty. |
1481 runtime penalty. |
1423 |
1482 |
1424 But this does not mean that everything is bad with automata. |
1483 But this does not mean that everything is bad with automata. Recall |
1425 Recall the problem of finding a regular expressions for the |
1484 the problem of finding a regular expressions for the language that is |
1426 language that is \emph{not} recognised by a regular |
1485 \emph{not} recognised by a regular expression. In our implementation |
1427 expression. In our implementation we added explicitly such a |
1486 we added explicitly such a regular expressions because they are useful |
1428 regular expressions because they are useful for recognising |
1487 for recognising comments. But in principle we did not need to. The |
1429 comments. But in principle we did not need to. The argument |
1488 argument for this is as follows: take a regular expression, translate |
1430 for this is as follows: take a regular expression, translate |
|
1431 it into a NFA and then a DFA that both recognise the same |
1489 it into a NFA and then a DFA that both recognise the same |
1432 language. Once you have the DFA it is very easy to construct |
1490 language. Once you have the DFA it is very easy to construct the |
1433 the automaton for the language not recognised by a DFA. If |
1491 automaton for the language not recognised by a DFA. If the DFA is |
1434 the DFA is completed (this is important!), then you just need |
1492 completed (this is important!), then you just need to exchange the |
1435 to exchange the accepting and non-accepting states. You can |
1493 accepting and non-accepting states. You can then translate this DFA |
1436 then translate this DFA back into a regular expression and |
1494 back into a regular expression and that will be the regular expression |
1437 that will be the regular expression that can match all strings |
1495 that can match all strings the original regular expression could |
1438 the original regular expression could \emph{not} match. |
1496 \emph{not} match. |
1439 |
1497 |
1440 It is also interesting that not all languages are regular. The |
1498 It is also interesting that not all languages are regular. The most |
1441 most well-known example of a language that is not regular |
1499 well-known example of a language that is not regular consists of all |
1442 consists of all the strings of the form |
1500 the strings of the form |
1443 |
1501 |
1444 \[a^n\,b^n\] |
1502 \[a^n\,b^n\] |
1445 |
1503 |
1446 \noindent meaning strings that have the same number of $a$s |
1504 \noindent meaning strings that have the same number of $a$s and |
1447 and $b$s. You can try, but you cannot find a regular |
1505 $b$s. You can try, but you cannot find a regular expression for this |
1448 expression for this language and also not an automaton. One |
1506 language and also not an automaton. One can actually prove that there |
1449 can actually prove that there is no regular expression nor |
1507 is no regular expression nor automaton for this language, but again |
1450 automaton for this language, but again that would lead us too |
1508 that would lead us too far afield for what we want to do in this |
1451 far afield for what we want to do in this module. |
1509 module. |
1452 |
1510 |
1453 %\section*{Further Reading} |
1511 %\section*{Further Reading} |
1454 |
1512 |
1455 %Compare what a ``human expert'' would create as an automaton for the |
1513 %Compare what a ``human expert'' would create as an automaton for the |
1456 %regular expression $a\cdot (b + c)^*$ and what the Thomson |
1514 %regular expression $a\cdot (b + c)^*$ and what the Thomson |