178 ``silently'' change to a different state. In the left picture, |
178 ``silently'' change to a different state. In the left picture, |
179 for example, if you are in the starting state, you can |
179 for example, if you are in the starting state, you can |
180 silently move either to $q_1$ or $q_2$. |
180 silently move either to $q_1$ or $q_2$. |
181 |
181 |
182 |
182 |
183 \subsection*{Thompson Construction} |
183 \subsubsection*{Thompson Construction} |
184 |
184 |
185 The reason for introducing NFAs is that there is a relatively |
185 The reason for introducing NFAs is that there is a relatively |
186 simple (recursive) translation of regular expressions into |
186 simple (recursive) translation of regular expressions into |
187 NFAs. Consider the simple regular expressions $\varnothing$, |
187 NFAs. Consider the simple regular expressions $\varnothing$, |
188 $\epsilon$ and $c$. They can be translated as follows: |
188 $\epsilon$ and $c$. They can be translated as follows: |
456 of nodes by going from NFAs to DFAs exponentially increases, |
456 of nodes by going from NFAs to DFAs exponentially increases, |
457 namely by $2^n$ (which is the number of subsets you can form |
457 namely by $2^n$ (which is the number of subsets you can form |
458 for $n$ nodes). |
458 for $n$ nodes). |
459 |
459 |
460 |
460 |
461 \subsection*{Brzozowski's Method} |
461 \subsubsection*{Brzozowski's Method} |
462 |
462 |
463 \subsection*{Automata Minimization} |
463 DFA -> NFA |
464 |
464 |
465 \subsection*{Regular Languages and Automata} |
465 \subsubsection*{Automata Minimization} |
|
466 |
|
467 \subsubsection*{Regular Languages} |
|
468 |
|
469 Given the constructions in the previous sections we obtain |
|
470 the following picture: |
|
471 |
|
472 \begin{center} |
|
473 \begin{tikzpicture} |
|
474 \node (rexp) {\bf Regexps}; |
|
475 \node (nfa) [right=of rexp] {\bf NFAs}; |
|
476 \node (dfa) [right=of nfa] {\bf DFAs}; |
|
477 \node (mdfa) [right=of dfa] {\bf\begin{tabular}{c}minimal\\ DFAs\end{tabular}}; |
|
478 \path[->,line width=1mm] (rexp) edge node [above=4mm, black] {\begin{tabular}{c@{\hspace{9mm}}}Thompson's\\[-1mm] construction\end{tabular}} (nfa); |
|
479 \path[->,line width=1mm] (nfa) edge node [above=4mm, black] {\begin{tabular}{c}subset\\[-1mm] construction\end{tabular}}(dfa); |
|
480 \path[->,line width=1mm] (dfa) edge node [below=5mm, black] {minimisation} (mdfa); |
|
481 \path[->,line width=1mm] (dfa) edge [bend left=45] (rexp); |
|
482 \end{tikzpicture} |
|
483 \end{center} |
|
484 |
|
485 \noindent By going from regular expressions over NFAs to DFAs, |
|
486 we can always ensure that for every regular expression there |
|
487 exists a NFA and DFA that can recognise the same language. |
|
488 Although we did not prove this fact. Similarly by going from |
|
489 DFAs to regular expressions, we can make sure for every DFA |
|
490 there exists a regular expression that can recognise the same |
|
491 language. Again we did not prove this fact. |
|
492 |
|
493 The interesting conclusion is that automata and regular |
|
494 expressions can recognise the same set of languages: |
|
495 |
|
496 \begin{quote} A language is \emph{regular} iff there exists a |
|
497 regular expression that recognises all its strings. |
|
498 \end{quote} |
|
499 |
|
500 \noindent or equivalently |
|
501 |
|
502 \begin{quote} A language is \emph{regular} iff there exists an |
|
503 automaton that recognises all its strings. |
|
504 \end{quote} |
|
505 |
|
506 \noindent So for deciding whether a string is recognised by a |
|
507 regular expression, we could use our algorithm based on |
|
508 derivatives or NFAs or DFAs. But let us quickly look at what |
|
509 the differences mean in computational terms. Translating a |
|
510 regular expression into a NFA gives us an automaton that has |
|
511 $O(n)$ nodes---that means the size of the NFA grows linearly |
|
512 with the size of the regular expression. The problem with NFAs |
|
513 is that the problem of deciding whether a string is accepted |
|
514 is computationally not cheap. Remember with NFAs we have |
|
515 potentially many next states even for the same input and also |
|
516 have the silent $\epsilon$-transitions. If we want to find a |
|
517 path from the starting state of an NFA to an accepting state, |
|
518 we need to consider all possibilities. In Ruby and Python this |
|
519 is done by a depth-first search, which in turn means that if a |
|
520 ``wrong'' choice is made, the algorithm has to backtrack and |
|
521 thus explore all potential candidates. This is exactly the |
|
522 reason why Ruby and Python are so slow for evil regular |
|
523 expressions. The alternative is to explore the search space |
|
524 in a breadth-first fashion, but this might incur a memory |
|
525 penalty. |
|
526 |
|
527 To avoid the problems with NFAs, we can translate them |
|
528 into DFAs. With DFAs the problem of deciding whether a |
|
529 string is recognised or not is much simpler, because in |
|
530 each state it is completely determined what the next |
|
531 state will be for a given input. So no search is needed. |
|
532 The problem with this is that the translation to DFAs |
|
533 can explode exponentially the number of states. Therefore when |
|
534 this route is taken, we definitely need to minimise the |
|
535 resulting DFAs in order to have an acceptable memory |
|
536 and runtime behaviour. |
|
537 |
|
538 But this does not mean that everything is bad with automata. |
|
539 Recall the problem of finding a regular expressions for the |
|
540 language that is \emph{not} recognised by a regular |
|
541 expression. In our implementation we added explicitly such a |
|
542 regular expressions because they are useful for recognising |
|
543 comments. But in principle we did not need to. The argument |
|
544 for this is as follows: take a regular expression, translate |
|
545 it into a NFA and DFA that recognise the same language. Once |
|
546 you have the DFA it is very easy to construct the automaton |
|
547 for the language not recognised by an DFA. If the DAF is |
|
548 completed (this is important!), then you just need to exchange |
|
549 the accepting and non-accepting states. You can then translate |
|
550 this DFA back into a regular expression. |
466 |
551 |
467 \end{document} |
552 \end{document} |
468 |
553 |
469 %%% Local Variables: |
554 %%% Local Variables: |
470 %%% mode: latex |
555 %%% mode: latex |