597 |
597 |
598 \noindent Unfortunately we cannot make any more progress with |
598 \noindent Unfortunately we cannot make any more progress with |
599 substituting equations, because both (6) and (7) contain the |
599 substituting equations, because both (6) and (7) contain the |
600 variable on the left-hand side also on the right-hand side. |
600 variable on the left-hand side also on the right-hand side. |
601 Here we need to now use a law that is different from the usual |
601 Here we need to now use a law that is different from the usual |
602 laws. It is called \emph{Arden's rule}. It states that |
602 laws about linear equations. It is called \emph{Arden's rule}. |
603 if an equation is of the form $q = q\,r + s$ then it can be |
603 It states that if an equation is of the form $q = q\,r + s$ |
604 transformed to $q = s\, r^*$. Since we can assume $+$ is |
604 then it can be transformed to $q = s\, r^*$. Since we can |
605 symmetric, equation (7) is of that form: $s$ is $q_0\,a\,a$ |
605 assume $+$ is symmetric, Equation (7) is of that form: $s$ is |
606 and $r$ is $a$. That means we can transform Equation (7) |
606 $q_0\,a\,a$ and $r$ is $a$. That means we can transform |
607 to obtain the two new equations |
607 (7) to obtain the two new equations |
608 |
608 |
609 \begin{eqnarray} |
609 \begin{eqnarray} |
610 q_0 & = & \epsilon + q_0\,(b + a\,b) + q_2\,b\\ |
610 q_0 & = & \epsilon + q_0\,(b + a\,b) + q_2\,b\\ |
611 q_2 & = & q_0\,a\,a\,(a^*) |
611 q_2 & = & q_0\,a\,a\,(a^*) |
612 \end{eqnarray} |
612 \end{eqnarray} |
631 |
631 |
632 \begin{eqnarray} |
632 \begin{eqnarray} |
633 q_0 & = & \epsilon\,(b + a\,b + a\,a\,(a^*)\,b)^* |
633 q_0 & = & \epsilon\,(b + a\,b + a\,a\,(a^*)\,b)^* |
634 \end{eqnarray} |
634 \end{eqnarray} |
635 |
635 |
636 \noindent SInce this is a regular expression, we can simplify |
636 \noindent Since this is a regular expression, we can simplify |
637 away the $\epsilon$ to obtain the slightly simpler regular |
637 away the $\epsilon$ to obtain the slightly simpler regular |
638 expression |
638 expression |
639 |
639 |
640 \begin{eqnarray} |
640 \begin{eqnarray} |
641 q_0 & = & (b + a\,b + a\,a\,(a^*)\,b)^* |
641 q_0 & = & (b + a\,b + a\,a\,(a^*)\,b)^* |
835 \end{tikzpicture} |
835 \end{tikzpicture} |
836 \end{center} |
836 \end{center} |
837 |
837 |
838 \noindent By going from regular expressions over NFAs to DFAs, |
838 \noindent By going from regular expressions over NFAs to DFAs, |
839 we can always ensure that for every regular expression there |
839 we can always ensure that for every regular expression there |
840 exists a NFA and DFA that can recognise the same language. |
840 exists a NFA and a DFA that can recognise the same language. |
841 Although we did not prove this fact. Similarly by going from |
841 Although we did not prove this fact. Similarly by going from |
842 DFAs to regular expressions, we can make sure for every DFA |
842 DFAs to regular expressions, we can make sure for every DFA |
843 there exists a regular expression that can recognise the same |
843 there exists a regular expression that can recognise the same |
844 language. Again we did not prove this fact. |
844 language. Again we did not prove this fact. |
845 |
845 |
862 the differences mean in computational terms. Translating a |
862 the differences mean in computational terms. Translating a |
863 regular expression into a NFA gives us an automaton that has |
863 regular expression into a NFA gives us an automaton that has |
864 $O(n)$ nodes---that means the size of the NFA grows linearly |
864 $O(n)$ nodes---that means the size of the NFA grows linearly |
865 with the size of the regular expression. The problem with NFAs |
865 with the size of the regular expression. The problem with NFAs |
866 is that the problem of deciding whether a string is accepted |
866 is that the problem of deciding whether a string is accepted |
867 or not is computationally not cheap. Remember with NFAs we have |
867 or not is computationally not cheap. Remember with NFAs we |
868 potentially many next states even for the same input and also |
868 have potentially many next states even for the same input and |
869 have the silent $\epsilon$-transitions. If we want to find a |
869 also have the silent $\epsilon$-transitions. If we want to |
870 path from the starting state of an NFA to an accepting state, |
870 find a path from the starting state of an NFA to an accepting |
871 we need to consider all possibilities. In Ruby and Python this |
871 state, we need to consider all possibilities. In Ruby and |
872 is done by a depth-first search, which in turn means that if a |
872 Python this is done by a depth-first search, which in turn |
873 ``wrong'' choice is made, the algorithm has to backtrack and |
873 means that if a ``wrong'' choice is made, the algorithm has to |
874 thus explore all potential candidates. This is exactly the |
874 backtrack and thus explore all potential candidates. This is |
875 reason why Ruby and Python are so slow for evil regular |
875 exactly the reason why Ruby and Python are so slow for evil |
876 expressions. The alternative is to explore the search space |
876 regular expressions. An alternative to the potentially slow |
877 in a breadth-first fashion, but this might incur a big memory |
877 depth-first search is to explore the search space in a |
|
878 breadth-first fashion, but this might incur a big memory |
878 penalty. |
879 penalty. |
879 |
880 |
880 To avoid the problems with NFAs, we can translate them |
881 To avoid the problems with NFAs, we can translate them |
881 into DFAs. With DFAs the problem of deciding whether a |
882 into DFAs. With DFAs the problem of deciding whether a |
882 string is recognised or not is much simpler, because in |
883 string is recognised or not is much simpler, because in |
886 can explode exponentially the number of states. Therefore when |
887 can explode exponentially the number of states. Therefore when |
887 this route is taken, we definitely need to minimise the |
888 this route is taken, we definitely need to minimise the |
888 resulting DFAs in order to have an acceptable memory |
889 resulting DFAs in order to have an acceptable memory |
889 and runtime behaviour. But remember the subset construction |
890 and runtime behaviour. But remember the subset construction |
890 in the worst case explodes the number of states by $2^n$. |
891 in the worst case explodes the number of states by $2^n$. |
|
892 Effectively also the translation to DFAs can incur a big |
|
893 runtime penalty. |
891 |
894 |
892 But this does not mean that everything is bad with automata. |
895 But this does not mean that everything is bad with automata. |
893 Recall the problem of finding a regular expressions for the |
896 Recall the problem of finding a regular expressions for the |
894 language that is \emph{not} recognised by a regular |
897 language that is \emph{not} recognised by a regular |
895 expression. In our implementation we added explicitly such a |
898 expression. In our implementation we added explicitly such a |
896 regular expressions because they are useful for recognising |
899 regular expressions because they are useful for recognising |
897 comments. But in principle we did not need to. The argument |
900 comments. But in principle we did not need to. The argument |
898 for this is as follows: take a regular expression, translate |
901 for this is as follows: take a regular expression, translate |
899 it into a NFA and DFA that recognise the same language. Once |
902 it into a NFA and then a DFA that both recognise the same |
900 you have the DFA it is very easy to construct the automaton |
903 language. Once you have the DFA it is very easy to construct |
901 for the language not recognised by an DFA. If the DFA is |
904 the automaton for the language not recognised by an DFA. If |
902 completed (this is important!), then you just need to exchange |
905 the DFA is completed (this is important!), then you just need |
903 the accepting and non-accepting states. You can then translate |
906 to exchange the accepting and non-accepting states. You can |
904 this DFA back into a regular expression. |
907 then translate this DFA back into a regular expression and |
905 |
908 that will be the regular expression that can match all strings |
906 Not all languages are regular. The most well-known example |
909 the original regular expression could \emph{not} match. |
907 of a language that is not regular consists of all the strings |
910 |
908 of the form |
911 It is also interesting that not all languages are regular. The |
|
912 most well-known example of a language that is not regular |
|
913 consists of all the strings of the form |
909 |
914 |
910 \[a^n\,b^n\] |
915 \[a^n\,b^n\] |
911 |
916 |
912 \noindent meaning strings that have the same number of $a$s |
917 \noindent meaning strings that have the same number of $a$s |
913 and $b$s. You can try, but you cannot find a regular |
918 and $b$s. You can try, but you cannot find a regular |