handouts/ho03.tex
changeset 344 408fd5994288
parent 333 8890852e18b7
child 349 434891622131
equal deleted inserted replaced
343:539b2e88f5b9 344:408fd5994288
   375 \subsubsection*{Subset Construction}
   375 \subsubsection*{Subset Construction}
   376 
   376 
   377 What is interesting is that for every NFA we can find a DFA
   377 What is interesting is that for every NFA we can find a DFA
   378 which recognises the same language. This can, for example, be
   378 which recognises the same language. This can, for example, be
   379 done by the \emph{subset construction}. Consider again the NFA
   379 done by the \emph{subset construction}. Consider again the NFA
   380 on the left, consisting of nodes labeled $0$, $1$ and $2$. 
   380 below on the left, consisting of nodes labeled $0$, $1$ and $2$. 
   381 
   381 
   382 \begin{center}
   382 \begin{center}
   383 \begin{tabular}{c@{\hspace{10mm}}c}
   383 \begin{tabular}{c@{\hspace{10mm}}c}
   384 \begin{tikzpicture}[scale=0.7,>=stealth',very thick,
   384 \begin{tikzpicture}[scale=0.7,>=stealth',very thick,
   385                     every state/.style={minimum size=0pt,
   385                     every state/.style={minimum size=0pt,
   396 \end{tikzpicture}
   396 \end{tikzpicture}
   397 &
   397 &
   398 \begin{tabular}{r|cl}
   398 \begin{tabular}{r|cl}
   399 nodes & $a$ & $b$\\
   399 nodes & $a$ & $b$\\
   400 \hline
   400 \hline
   401 $\varnothing\phantom{\star}$ & $\varnothing$ & $\varnothing$\\
   401 $\{\}\phantom{\star}$ & $\{\}$ & $\{\}$\\
   402 $\{0\}\phantom{\star}$       & $\{0,1,2\}$   & $\{2\}$\\
   402 $\{0\}\phantom{\star}$       & $\{0,1,2\}$   & $\{2\}$\\
   403 $\{1\}\phantom{\star}$       & $\{1\}$       & $\varnothing$\\
   403 $\{1\}\phantom{\star}$       & $\{1\}$       & $\{\}$\\
   404 $\{2\}\star$  & $\varnothing$ & $\{2\}$\\
   404 $\{2\}\star$  & $\{\}$ & $\{2\}$\\
   405 $\{0,1\}\phantom{\star}$     & $\{0,1,2\}$   & $\{2\}$\\
   405 $\{0,1\}\phantom{\star}$     & $\{0,1,2\}$   & $\{2\}$\\
   406 $\{0,2\}\star$ & $\{0,1,2\}$   & $\{2\}$\\
   406 $\{0,2\}\star$ & $\{0,1,2\}$   & $\{2\}$\\
   407 $\{1,2\}\star$ & $\{1\}$       & $\{2\}$\\
   407 $\{1,2\}\star$ & $\{1\}$       & $\{2\}$\\
   408 s: $\{0,1,2\}\star$ & $\{0,1,2\}$ & $\{2\}$\\
   408 s: $\{0,1,2\}\star$ & $\{0,1,2\}$ & $\{2\}$\\
   409 \end{tabular}
   409 \end{tabular}
   411 \end{center}
   411 \end{center}
   412 
   412 
   413 \noindent The nodes of the DFA are given by calculating all
   413 \noindent The nodes of the DFA are given by calculating all
   414 subsets of the set of nodes of the NFA (seen in the nodes
   414 subsets of the set of nodes of the NFA (seen in the nodes
   415 column on the right). The table shows the transition function
   415 column on the right). The table shows the transition function
   416 for the DFA. The first row states that $\varnothing$ is the
   416 for the DFA. The first row states that $\{\}$ is the
   417 sink node which has transitions for $a$ and $b$ to itself.
   417 sink node which has transitions for $a$ and $b$ to itself.
   418 The next three lines are calculated as follows: 
   418 The next three lines are calculated as follows: 
   419 
   419 
   420 \begin{itemize}
   420 \begin{itemize}
   421 \item suppose you calculate the entry for the transition for
   421 \item suppose you calculate the entry for the transition for
   425       set of nodes, in this case $\{0,1,2\}$
   425       set of nodes, in this case $\{0,1,2\}$
   426 \item filter out all notes that do not allow an $a$-transition
   426 \item filter out all notes that do not allow an $a$-transition
   427       from this set, this excludes $2$ which does not permit a
   427       from this set, this excludes $2$ which does not permit a
   428       $a$-transition
   428       $a$-transition
   429 \item from the remaining set, do as many $\epsilon$-transition
   429 \item from the remaining set, do as many $\epsilon$-transition
   430       as you can, this yields $\{0,1,2\}$     
   430       as you can, this yields again $\{0,1,2\}$     
   431 \item the resulting set specifies the transition from $\{0\}$
   431 \item the resulting set specifies the transition from $\{0\}$
   432       when given an $a$ 
   432       when given an $a$ 
   433 \end{itemize}
   433 \end{itemize}
   434 
   434 
   435 \noindent Similarly for the other entries in the rows for
   435 \noindent So the transition from the state $\{0\}$ reading an
   436 $\{0\}$, $\{1\}$ and $\{2\}$. The other rows are calculated by
   436 $a$ goes to the state $\{0,1,2\}$. Similarly for the other
   437 just taking the union of the single node entries. For example
   437 entries in the rows for $\{0\}$, $\{1\}$ and $\{2\}$. The
   438 for $a$ and $\{0,1\}$ we need to take the union of $\{0,1,2\}$
   438 other rows are calculated by just taking the union of the
   439 (for $0$) and $\{1\}$ (for $1$). The starting state of the DFA
   439 single node entries. For example for $a$ and $\{0,1\}$ we need
   440 can be calculated from the starting state of the NFA, that is
   440 to take the union of $\{0,1,2\}$ (for $0$) and $\{1\}$ (for
   441 $0$, and then do as many $\epsilon$-transitions as possible.
   441 $1$). The starting state of the DFA can be calculated from the
   442 This gives $\{0,1,2\}$ which is the starting state of the DFA.
   442 starting state of the NFA, that is $0$, and then do as many
   443 The terminal states in the DFA are given by all sets that
   443 $\epsilon$-transitions as possible. This gives $\{0,1,2\}$
   444 contain a $2$, which is the terminal state of the NFA. This
   444 which is the starting state of the DFA. The terminal states in
   445 completes the subset construction.
   445 the DFA are given by all sets that contain a $2$, which is the
       
   446 terminal state of the NFA. This completes the subset
       
   447 construction. So the corresponding DFA to the NFA from 
       
   448 above is
       
   449 
       
   450 \begin{center}
       
   451 \begin{tikzpicture}[scale=0.7,>=stealth',very thick,
       
   452                     every state/.style={minimum size=0pt,
       
   453                     draw=blue!50,very thick,fill=blue!20},
       
   454                     baseline=0mm]
       
   455 \node[state,initial,accepting]  (q012)  {$0,1,2$};
       
   456 \node[state,accepting] (q02) [right=of q012] {$0,2$};
       
   457 \node[state] (q01) [above=of q02] {$0,1$};
       
   458 \node[state,accepting] (q12) [below=of q02] {$1,2$};
       
   459 \node[state] (q0) [right=2cm of q01] {$0$};
       
   460 \node[state] (q1) [right=2.5cm of q02] {$1$};
       
   461 \node[state,accepting] (q2) [right=1.5cm of q12] {$2$};
       
   462 \node[state] (qn) [right=of q1] {$\{\}$};
       
   463 
       
   464 \path[->] (q012) edge [loop below] node  {$a$} ();
       
   465 \path[->] (q012) edge node [above]  {$b$} (q2);
       
   466 \path[->] (q12) edge [bend left] node [below,pos=0.4]  {$a$} (q1);
       
   467 \path[->] (q12) edge node [below]  {$b$} (q2);
       
   468 \path[->] (q02) edge node [above]  {$a$} (q012);
       
   469 \path[->] (q02) edge [bend left] node [above, pos=0.8]  {$b$} (q2);
       
   470 \path[->] (q01) edge node [below]  {$a$} (q012);
       
   471 \path[->] (q01) edge [bend left] node [above]  {$b$} (q2);
       
   472 \path[->] (q0) edge node [below]  {$a$} (q012);
       
   473 \path[->] (q0) edge node [right, pos=0.2]  {$b$} (q2);
       
   474 \path[->] (q1) edge [loop above] node  {$a$} ();
       
   475 \path[->] (q1) edge node [above]  {$b$} (qn);
       
   476 \path[->] (q2) edge [loop right] node  {$b$} ();
       
   477 \path[->] (q2) edge node [below]  {$a$} (qn);
       
   478 \path[->] (qn) edge [loop above] node  {$a,b$} ();
       
   479 \end{tikzpicture}
       
   480 \end{center}
       
   481 
       
   482 
   446 
   483 
   447 There are two points to note: One is that very often the
   484 There are two points to note: One is that very often the
   448 resulting DFA contains a number of ``dead'' nodes that are
   485 resulting DFA contains a number of ``dead'' nodes that are
   449 never reachable from the starting state (that is that the
   486 never reachable from the starting state. For example
   450 calculated DFA in this example is not a minimal DFA). Such
   487 there is no way to reach node $\{0,2\}$ from the starting
       
   488 state $\{0,1,2\}$. I let you find the other dead states.
       
   489 In effect the DFA in this example is not a minimal DFA. Such
   451 dead nodes can be safely removed without changing the language
   490 dead nodes can be safely removed without changing the language
   452 that is recognised by the DFA. Another point is that in some
   491 that is recognised by the DFA. Another point is that in some
   453 cases, however, the subset construction produces a DFA that
   492 cases, however, the subset construction produces a DFA that
   454 does \emph{not} contain any dead nodes\ldots{}that means it
   493 does \emph{not} contain any dead nodes\ldots{}that means it
   455 calculates a minimal DFA. Which in turn means that in some
   494 calculates a minimal DFA. Which in turn means that in some
   456 cases the number of nodes by going from NFAs to DFAs
   495 cases the number of nodes by going from NFAs to DFAs
   457 exponentially increases, namely by $2^n$ (which is the number
   496 exponentially increases, namely by $2^n$ (which is the number
   458 of subsets you can form for $n$ nodes). 
   497 of subsets you can form for $n$ nodes). 
   459 
   498 
       
   499 Removing all the dead states in the automaton above,
       
   500 gives a much more legible automaton, namely
       
   501 
       
   502 \begin{center}
       
   503 \begin{tikzpicture}[scale=0.7,>=stealth',very thick,
       
   504                     every state/.style={minimum size=0pt,
       
   505                     draw=blue!50,very thick,fill=blue!20},
       
   506                     baseline=0mm]
       
   507 \node[state,initial,accepting]  (q012)  {$0,1,2$};
       
   508 \node[state,accepting] (q2) [right=of q012] {$2$};
       
   509 \node[state] (qn) [right=of q2] {$\{\}$};
       
   510 
       
   511 \path[->] (q012) edge [loop below] node  {$a$} ();
       
   512 \path[->] (q012) edge node [above]  {$b$} (q2);
       
   513 \path[->] (q2) edge [loop below] node  {$b$} ();
       
   514 \path[->] (q2) edge node [below]  {$a$} (qn);
       
   515 \path[->] (qn) edge [loop above] node  {$a,b$} ();
       
   516 \end{tikzpicture}
       
   517 \end{center}
       
   518 
       
   519 \noindent Now the big question is whether this DFA
       
   520 can recognise the same language as the NFA we started with.
       
   521 I let you ponder about this question.
   460 
   522 
   461 \subsubsection*{Brzozowski's Method}
   523 \subsubsection*{Brzozowski's Method}
   462 
   524 
   463 As said before, we can also go into the other direction---from
   525 As said before, we can also go into the other direction---from
   464 DFAs to regular expressions. Brzozowski's method calculates
   526 DFAs to regular expressions. Brzozowski's method calculates
   767 \node (dfa) [right=of nfa] {\bf DFAs};
   829 \node (dfa) [right=of nfa] {\bf DFAs};
   768 \node (mdfa) [right=of dfa] {\bf\begin{tabular}{c}minimal\\ DFAs\end{tabular}};
   830 \node (mdfa) [right=of dfa] {\bf\begin{tabular}{c}minimal\\ DFAs\end{tabular}};
   769 \path[->,line width=1mm] (rexp) edge node [above=4mm, black] {\begin{tabular}{c@{\hspace{9mm}}}Thompson's\\[-1mm] construction\end{tabular}} (nfa);
   831 \path[->,line width=1mm] (rexp) edge node [above=4mm, black] {\begin{tabular}{c@{\hspace{9mm}}}Thompson's\\[-1mm] construction\end{tabular}} (nfa);
   770 \path[->,line width=1mm] (nfa) edge node [above=4mm, black] {\begin{tabular}{c}subset\\[-1mm] construction\end{tabular}}(dfa);
   832 \path[->,line width=1mm] (nfa) edge node [above=4mm, black] {\begin{tabular}{c}subset\\[-1mm] construction\end{tabular}}(dfa);
   771 \path[->,line width=1mm] (dfa) edge node [below=5mm, black] {minimisation} (mdfa);
   833 \path[->,line width=1mm] (dfa) edge node [below=5mm, black] {minimisation} (mdfa);
   772 \path[->,line width=1mm] (dfa) edge [bend left=45] (rexp);
   834 \path[->,line width=1mm] (dfa) edge [bend left=45] node [below] {\begin{tabular}{l}Brzozowski's\\ method\end{tabular}} (rexp);
   773 \end{tikzpicture}
   835 \end{tikzpicture}
   774 \end{center}
   836 \end{center}
   775 
   837 
   776 \noindent By going from regular expressions over NFAs to DFAs,
   838 \noindent By going from regular expressions over NFAs to DFAs,
   777 we can always ensure that for every regular expression there
   839 we can always ensure that for every regular expression there
   800 the differences mean in computational terms. Translating a
   862 the differences mean in computational terms. Translating a
   801 regular expression into a NFA gives us an automaton that has
   863 regular expression into a NFA gives us an automaton that has
   802 $O(n)$ nodes---that means the size of the NFA grows linearly
   864 $O(n)$ nodes---that means the size of the NFA grows linearly
   803 with the size of the regular expression. The problem with NFAs
   865 with the size of the regular expression. The problem with NFAs
   804 is that the problem of deciding whether a string is accepted
   866 is that the problem of deciding whether a string is accepted
   805 is computationally not cheap. Remember with NFAs we have
   867 or not is computationally not cheap. Remember with NFAs we have
   806 potentially many next states even for the same input and also
   868 potentially many next states even for the same input and also
   807 have the silent $\epsilon$-transitions. If we want to find a
   869 have the silent $\epsilon$-transitions. If we want to find a
   808 path from the starting state of an NFA to an accepting state,
   870 path from the starting state of an NFA to an accepting state,
   809 we need to consider all possibilities. In Ruby and Python this
   871 we need to consider all possibilities. In Ruby and Python this
   810 is done by a depth-first search, which in turn means that if a
   872 is done by a depth-first search, which in turn means that if a
   811 ``wrong'' choice is made, the algorithm has to backtrack and
   873 ``wrong'' choice is made, the algorithm has to backtrack and
   812 thus explore all potential candidates. This is exactly the
   874 thus explore all potential candidates. This is exactly the
   813 reason why Ruby and Python are so slow for evil regular
   875 reason why Ruby and Python are so slow for evil regular
   814 expressions. The alternative is to explore the search space
   876 expressions. The alternative is to explore the search space
   815 in a breadth-first fashion, but this might incur a memory
   877 in a breadth-first fashion, but this might incur a big memory
   816 penalty.  
   878 penalty.  
   817 
   879 
   818 To avoid the problems with NFAs, we can translate them 
   880 To avoid the problems with NFAs, we can translate them 
   819 into DFAs. With DFAs the problem of deciding whether a
   881 into DFAs. With DFAs the problem of deciding whether a
   820 string is recognised or not is much simpler, because in
   882 string is recognised or not is much simpler, because in
   822 state will be for a given input. So no search is needed.
   884 state will be for a given input. So no search is needed.
   823 The problem with this is that the translation to DFAs
   885 The problem with this is that the translation to DFAs
   824 can explode exponentially the number of states. Therefore when 
   886 can explode exponentially the number of states. Therefore when 
   825 this route is taken, we definitely need to minimise the
   887 this route is taken, we definitely need to minimise the
   826 resulting DFAs in order to have an acceptable memory 
   888 resulting DFAs in order to have an acceptable memory 
   827 and runtime behaviour.
   889 and runtime behaviour. But remember the subset construction
       
   890 in the worst case explodes the number of states by $2^n$.
   828 
   891 
   829 But this does not mean that everything is bad with automata.
   892 But this does not mean that everything is bad with automata.
   830 Recall the problem of finding a regular expressions for the
   893 Recall the problem of finding a regular expressions for the
   831 language that is \emph{not} recognised by a regular
   894 language that is \emph{not} recognised by a regular
   832 expression. In our implementation we added explicitly such a
   895 expression. In our implementation we added explicitly such a
   851 expression for this language and also not an automaton. One
   914 expression for this language and also not an automaton. One
   852 can actually prove that there is no regular expression nor
   915 can actually prove that there is no regular expression nor
   853 automaton for this language, but again that would lead us too
   916 automaton for this language, but again that would lead us too
   854 far afield for what we want to do in this module.
   917 far afield for what we want to do in this module.
   855 
   918 
   856 \section{Further Reading}
   919 \section*{Further Reading}
   857 
   920 
   858 Compare what a ``human expert'' would create as automaton for the
   921 Compare what a ``human expert'' would create as automaton for the
   859 regular expression $a (b + c)^*$ and what the Thomson
   922 regular expression $a (b + c)^*$ and what the Thomson
   860 algorithm generates.
   923 algorithm generates.
   861 
   924