45 one state to the next state with respect to a character. We have the |
45 one state to the next state with respect to a character. We have the |
46 assumption that these transition functions do not need to be defined |
46 assumption that these transition functions do not need to be defined |
47 everywhere: so it can be the case that given a character there is no |
47 everywhere: so it can be the case that given a character there is no |
48 next state, in which case we need to raise a kind of ``failure |
48 next state, in which case we need to raise a kind of ``failure |
49 exception''. That means actually we have \emph{partial} functions as |
49 exception''. That means actually we have \emph{partial} functions as |
50 transitions---see the Scala implementation of DFAs later on. A |
50 transitions---see the Scala implementation for DFAs later on. A |
51 typical example of a DFA is |
51 typical example of a DFA is |
52 |
52 |
53 \begin{center} |
53 \begin{center} |
54 \begin{tikzpicture}[>=stealth',very thick,auto, |
54 \begin{tikzpicture}[>=stealth',very thick,auto, |
55 every state/.style={minimum size=0pt, |
55 every state/.style={minimum size=0pt, |
92 (Q_4, a) &\rightarrow& Q_4\\ |
92 (Q_4, a) &\rightarrow& Q_4\\ |
93 (Q_4, b) &\rightarrow& Q_4\\ |
93 (Q_4, b) &\rightarrow& Q_4\\ |
94 \end{array} |
94 \end{array} |
95 \] |
95 \] |
96 |
96 |
|
97 \noindent |
|
98 Please check that this table represents the same transition function |
|
99 as the graph above. |
|
100 |
97 We need to define the notion of what language is accepted by |
101 We need to define the notion of what language is accepted by |
98 an automaton. For this we lift the transition function |
102 an automaton. For this we lift the transition function |
99 $\delta$ from characters to strings as follows: |
103 $\delta$ from characters to strings as follows: |
100 |
104 |
101 \[ |
105 \[ |
233 \end{tikzpicture} |
237 \end{tikzpicture} |
234 \end{center} |
238 \end{center} |
235 |
239 |
236 \noindent |
240 \noindent |
237 This NFA happens to have only one starting state, but in general there |
241 This NFA happens to have only one starting state, but in general there |
238 could be more. Notice that in state $Q_0$ we might go to state $Q_1$ |
242 could be more than one. Notice that in state $Q_0$ we might go to |
239 \emph{or} to state $Q_2$ when receiving an $a$. Similarly in state |
243 state $Q_1$ \emph{or} to state $Q_2$ when receiving an $a$. Similarly |
240 $Q_1$ and receiving an $a$, we can stay in state $Q_1$ \emph{or} go to |
244 in state $Q_1$ and receiving an $a$, we can stay in state $Q_1$ |
241 $Q_2$. This kind of choice is not allowed with DFAs. The downside of |
245 \emph{or} go to $Q_2$. This kind of choice is not allowed with |
242 this choice in NFAs is that when it comes to deciding whether a string is |
246 DFAs. The downside of this choice in NFAs is that when it comes to |
243 accepted by a NFA we potentially have to explore all possibilities. I |
247 deciding whether a string is accepted by a NFA we potentially have to |
244 let you think which strings the above NFA accepts. |
248 explore all possibilities. I let you think which strings the above NFA |
|
249 accepts. |
245 |
250 |
246 |
251 |
247 There are a number of additional points you should note about |
252 There are a number of additional points you should note about |
248 NFAs. Every DFA is a NFA, but not vice versa. The $\rho$ in NFAs is a |
253 NFAs. Every DFA is a NFA, but not vice versa. The $\rho$ in NFAs is a |
249 transition \emph{relation} (DFAs have a transition function). The |
254 transition \emph{relation} (DFAs have a transition function). The |
528 \end{equation} |
533 \end{equation} |
529 |
534 |
530 \noindent |
535 \noindent |
531 I let you think whether the NFAs can match exactly those strings the |
536 I let you think whether the NFAs can match exactly those strings the |
532 regular expressions can match. To do this translation in code we need |
537 regular expressions can match. To do this translation in code we need |
533 a way to construct states programatically...and as an additional |
538 a way to construct states ``programatically''...and as an additional |
534 constrain Scala needs to recognise that these states are being distinct. |
539 constraint Scala needs to recognise that these states are being distinct. |
535 For this I implemented in Figure~\ref{thompson1} a class |
540 For this I implemented in Figure~\ref{thompson1} a class |
536 \texttt{TState} that includes a counter and a companion object that |
541 \texttt{TState} that includes a counter and a companion object that |
537 increases this counter whenever a new state is created.\footnote{You might |
542 increases this counter whenever a new state is created.\footnote{You might |
538 have to read up what \emph{companion objects} do in Scala.} |
543 have to read up what \emph{companion objects} do in Scala.} |
539 |
544 |
544 implement a way of how to create new states that are all |
549 implement a way of how to create new states that are all |
545 distinct by virtue of a counter. This counter is |
550 distinct by virtue of a counter. This counter is |
546 increased in the companion object of \texttt{TState} |
551 increased in the companion object of \texttt{TState} |
547 whenever a new state is created. The code in Lines 24--40 |
552 whenever a new state is created. The code in Lines 24--40 |
548 constructs NFAs for the simple regular expressions $\ZERO$, $\ONE$ and $c$. |
553 constructs NFAs for the simple regular expressions $\ZERO$, $\ONE$ and $c$. |
549 Compare the pictures given in \eqref{simplecases}. |
554 Compare this code with the pictures given in \eqref{simplecases} on |
|
555 Page~\pageref{simplecases}. |
550 \label{thompson1}} |
556 \label{thompson1}} |
551 \end{figure} |
557 \end{figure} |
552 |
558 |
553 \begin{figure}[p] |
559 \begin{figure}[p] |
554 \small |
560 \small |
556 \caption{The second part of the Thompson Construction implementing |
562 \caption{The second part of the Thompson Construction implementing |
557 the composition of NFAs according to $\cdot$, $+$ and ${}^*$. |
563 the composition of NFAs according to $\cdot$, $+$ and ${}^*$. |
558 The implicit class about rich partial functions |
564 The implicit class about rich partial functions |
559 implements the infix operation \texttt{+++} which |
565 implements the infix operation \texttt{+++} which |
560 combines an $\epsilon$NFA transition with a NFA transition |
566 combines an $\epsilon$NFA transition with a NFA transition |
561 (both given as partial functions).\label{thompson2}} |
567 (both are given as partial functions---but with different type!).\label{thompson2}} |
562 \end{figure} |
568 \end{figure} |
563 |
569 |
564 The case for the sequence regular expression $r_1 \cdot r_2$ is a bit more |
570 The case for the sequence regular expression $r_1 \cdot r_2$ is a bit more |
565 complicated: Say, we are given by recursion two NFAs representing the regular |
571 complicated: Say, we are given by recursion two NFAs representing the regular |
566 expressions $r_1$ and $r_2$ respectively. |
572 expressions $r_1$ and $r_2$ respectively. |
656 The case for the alternative regular expression $r_1 + r_2$ is |
662 The case for the alternative regular expression $r_1 + r_2$ is |
657 slightly different: We are given by recursion two NFAs representing |
663 slightly different: We are given by recursion two NFAs representing |
658 $r_1$ and $r_2$ respectively. Each NFA has some starting states and |
664 $r_1$ and $r_2$ respectively. Each NFA has some starting states and |
659 some accepting states. We obtain a NFA for the regular expression $r_1 |
665 some accepting states. We obtain a NFA for the regular expression $r_1 |
660 + r_2$ by composing the transition functions (this crucially depends |
666 + r_2$ by composing the transition functions (this crucially depends |
661 on knowing that the states of each component NFA are distinct); and |
667 on knowing that the states of each component NFA are distinct---recall |
662 also combine the starting states and accepting functions. |
668 we implemented for this to hold some bespoke code for states). We also |
|
669 need to combine the starting states and accepting functions |
|
670 appropriately. |
663 |
671 |
664 \begin{center} |
672 \begin{center} |
665 \begin{tabular}[t]{ccc} |
673 \begin{tabular}[t]{ccc} |
666 \begin{tikzpicture}[node distance=3mm, |
674 \begin{tikzpicture}[node distance=3mm, |
667 >=stealth',very thick, |
675 >=stealth',very thick, |
730 accepting states to a new starting state via |
738 accepting states to a new starting state via |
731 $\epsilon$-transitions. This new starting state is also an accepting |
739 $\epsilon$-transitions. This new starting state is also an accepting |
732 state, because $r^*$ can recognise the empty string. |
740 state, because $r^*$ can recognise the empty string. |
733 |
741 |
734 \begin{center} |
742 \begin{center} |
735 \begin{tabular}[b]{@{\hspace{-4mm}}ccc@{}} |
743 \begin{tabular}[b]{@{}ccc@{}} |
736 \begin{tikzpicture}[node distance=3mm, |
744 \begin{tikzpicture}[node distance=3mm, |
737 >=stealth',very thick, |
745 >=stealth',very thick, |
738 every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20}, |
746 every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20}, |
739 baseline=(current bounding box.north)] |
747 baseline=(current bounding box.north)] |
740 \node at (0,0) (1) {$\mbox{}$}; |
748 \node (2) {$\mbox{}$}; |
741 \node[state, initial] (2) [right=16mm of 1] {$\mbox{}$}; |
749 \node[state, initial] (4) [above=1mm of 2] {$\mbox{}$}; |
|
750 \node[state, initial] (5) [below=1mm of 2] {$\mbox{}$}; |
742 \node (a) [right=of 2] {$\ldots$}; |
751 \node (a) [right=of 2] {$\ldots$}; |
743 \node[state, accepting] (a1) [right=of a] {$\mbox{}$}; |
752 \node[state, accepting] (a1) [right=of a] {$\mbox{}$}; |
744 \node[state, accepting] (a2) [above=of a1] {$\mbox{}$}; |
753 \node[state, accepting] (a2) [above=of a1] {$\mbox{}$}; |
745 \node[state, accepting] (a3) [below=of a1] {$\mbox{}$}; |
754 \node[state, accepting] (a3) [below=of a1] {$\mbox{}$}; |
746 \begin{pgfonlayer}{background} |
755 \begin{pgfonlayer}{background} |
754 \begin{tikzpicture}[node distance=3mm, |
763 \begin{tikzpicture}[node distance=3mm, |
755 >=stealth',very thick, |
764 >=stealth',very thick, |
756 every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20}, |
765 every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20}, |
757 baseline=(current bounding box.north)] |
766 baseline=(current bounding box.north)] |
758 \node at (0,0) [state, initial,accepting] (1) {$\mbox{}$}; |
767 \node at (0,0) [state, initial,accepting] (1) {$\mbox{}$}; |
759 \node[state] (2) [right=16mm of 1] {$\mbox{}$}; |
768 \node (2) [right=16mm of 1] {$\mbox{}$}; |
|
769 \node[state] (4) [above=1mm of 2] {$\mbox{}$}; |
|
770 \node[state] (5) [below=1mm of 2] {$\mbox{}$}; |
760 \node (a) [right=of 2] {$\ldots$}; |
771 \node (a) [right=of 2] {$\ldots$}; |
761 \node[state] (a1) [right=of a] {$\mbox{}$}; |
772 \node[state] (a1) [right=of a] {$\mbox{}$}; |
762 \node[state] (a2) [above=of a1] {$\mbox{}$}; |
773 \node[state] (a2) [above=of a1] {$\mbox{}$}; |
763 \node[state] (a3) [below=of a1] {$\mbox{}$}; |
774 \node[state] (a3) [below=of a1] {$\mbox{}$}; |
764 \path[->] (1) edge node [above] {$\epsilon$} (2); |
775 \path[->] (1) edge node [below] {$\epsilon$} (4); |
765 \path[->] (a1) edge [bend left=45] node [above] {$\epsilon$} (1); |
776 \path[->] (1) edge node [below] {$\epsilon$} (5); |
|
777 \path[->] (a1) edge [bend left=45] node [below] {$\epsilon$} (1); |
766 \path[->] (a2) edge [bend right] node [below] {$\epsilon$} (1); |
778 \path[->] (a2) edge [bend right] node [below] {$\epsilon$} (1); |
767 \path[->] (a3) edge [bend left=45] node [below] {$\epsilon$} (1); |
779 \path[->] (a3) edge [bend left=45] node [below] {$\epsilon$} (1); |
768 \begin{pgfonlayer}{background} |
780 \begin{pgfonlayer}{background} |
769 \node (2) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (1) (a2) (a3)] {}; |
781 \node (2) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (1) (a2) (a3)] {}; |
770 \node [yshift=3mm] at (2.north) {$r^*$}; |
782 \node [yshift=3mm] at (2.north) {$r^*$}; |
1074 \texttt{nexts} from the NFA class already calculates this for us. The |
1086 \texttt{nexts} from the NFA class already calculates this for us. The |
1075 accepting-states function for the DFA is true henevner at least one |
1087 accepting-states function for the DFA is true henevner at least one |
1076 state in the subset is accepting (that is true) in the NFA.\medskip |
1088 state in the subset is accepting (that is true) in the NFA.\medskip |
1077 |
1089 |
1078 \noindent |
1090 \noindent |
1079 You might be able to spend some quality tinkering with this code and |
1091 You might be able to spend some quality time tinkering with this code |
1080 time to ponder about it. Then you will probably notice it is actually |
1092 and time to ponder about it. Then you will probably notice that it is |
1081 a bit silly. The whole point of translating the NFA into a DFA via the |
1093 actually a bit silly. The whole point of translating the NFA into a |
1082 subset construction is to make the decision of whether a string is |
1094 DFA via the subset construction is to make the decision of whether a |
1083 accepted or not faster. Given the code above, the generated DFA will |
1095 string is accepted or not faster. Given the code above, the generated |
1084 be exactly as fast, or as slow, as the NFA we started with (actually |
1096 DFA will be exactly as fast, or as slow, as the NFA we started with |
1085 it will even be a tiny bit slower). The reason is that we just re-use |
1097 (actually it will even be a tiny bit slower). The reason is that we |
1086 the \texttt{nexts} function from the NFA. This function implements the |
1098 just re-use the \texttt{nexts} function from the NFA. This function |
1087 non-deterministic breadth-first search. You might be thinking: This |
1099 implements the non-deterministic breadth-first search. You might be |
1088 is cheating! \ldots{} Well, not quite as you will see later, but in |
1100 thinking: This is cheating! \ldots{} Well, not quite as you will see |
1089 terms of speed we still need to work a bit in order to get |
1101 later, but in terms of speed we still need to work a bit in order to |
1090 sometimes(!) a faster DFA. Let's do this next. |
1102 get sometimes(!) a faster DFA. Let's do this next. |
1091 |
1103 |
1092 \subsection*{DFA Minimisation} |
1104 \subsection*{DFA Minimisation} |
1093 |
1105 |
1094 As seen in \eqref{subsetdfa}, the subset construction from NFA to a |
1106 As seen in \eqref{subsetdfa}, the subset construction from NFA to a |
1095 DFA can result in a rather ``inefficient'' DFA. Meaning there are |
1107 DFA can result in a rather ``inefficient'' DFA. Meaning there are |
1096 states that are not needed. There are two kinds of such unneeded |
1108 states that are not needed. There are two kinds of such unneeded |
1097 states: \emph{unreachable} states and \emph{nondistinguishable} |
1109 states: \emph{unreachable} states and \emph{non-distinguishable} |
1098 states. The first kind of states can just be removed without affecting |
1110 states. The first kind of states can just be removed without affecting |
1099 the language that can be recognised (after all they are |
1111 the language that can be recognised (after all they are |
1100 unreachable). The second kind can also be recognised and thus a DFA |
1112 unreachable). The second kind can also be recognised and thus a DFA |
1101 can be \emph{minimised} by the following algorithm: |
1113 can be \emph{minimised} by the following algorithm: |
1102 |
1114 |
1251 the functions would be thrown away. So the compromise is to not being |
1263 the functions would be thrown away. So the compromise is to not being |
1252 able to minimise (easily) our DFAs. |
1264 able to minimise (easily) our DFAs. |
1253 |
1265 |
1254 \subsection*{Brzozowski's Method} |
1266 \subsection*{Brzozowski's Method} |
1255 |
1267 |
1256 I know tyhis is already a long, long rant: but after all it is a topic |
1268 I know this handout is already a long, long rant: but after all it is |
1257 that has been researched for more than 60 years. If you reflect on |
1269 a topic that has been researched for more than 60 years. If you |
1258 what you have read so far, the story is that you can take a regular |
1270 reflect on what you have read so far, the story is that you can take a |
1259 expression, translate it via the Thompson Construction into an |
1271 regular expression, translate it via the Thompson Construction into an |
1260 $\epsilon$NFA, then translate it into a NFA by removing all |
1272 $\epsilon$NFA, then translate it into a NFA by removing all |
1261 $\epsilon$-transitions, and then via the subset construction obtain a |
1273 $\epsilon$-transitions, and then via the subset construction obtain a |
1262 DFA. In all steps we made sure the language, or which strings can be |
1274 DFA. In all steps we made sure the language, or which strings can be |
1263 recognised, stays the same. After the last section, we can even |
1275 recognised, stays the same. Of couse we should have proved this in |
1264 minimise the DFA (maybe not in code). But again we made sure the same |
1276 each step, but let us cut corners here. After the last section, we |
1265 language is recognised. You might be wondering: Can we go into the |
1277 can even minimise the DFA (maybe not in code). But again we made sure |
1266 other direction? Can we go from a DFA and obtain a regular expression |
1278 the same language is recognised. You might be wondering: Can we go |
1267 that can recognise the same language as the DFA?\medskip |
1279 into the other direction? Can we go from a DFA and obtain a regular |
|
1280 expression that can recognise the same language as the DFA?\medskip |
1268 |
1281 |
1269 \noindent |
1282 \noindent |
1270 The answer is yes. Again there are several methods for calculating a |
1283 The answer is yes. Again there are several methods for calculating a |
1271 regular expression for a DFA. I will show you Brzozowski's method |
1284 regular expression for a DFA. I will show you Brzozowski's method |
1272 because it calculates a regular expression using quite familiar |
1285 because it calculates a regular expression using quite familiar |
1518 module. |
1531 module. |
1519 |
1532 |
1520 |
1533 |
1521 \subsection*{Where Have Derivatives Gone?} |
1534 \subsection*{Where Have Derivatives Gone?} |
1522 |
1535 |
1523 By now you are probably fed up by this text. It is now way too long |
1536 By now you are probably fed up with this text. It is now way too long |
1524 for one lecture, but there is still one aspect of the |
1537 for one lecture, but there is still one aspect of the |
1525 automata-connection I like to highlight for you. Perhaps by now you |
1538 automata-regular-expression-connection I like to describe. Perhaps by |
1526 are asking yourself: Where have the derivatives gone? Did we just |
1539 now you are asking yourself: Where have the derivatives gone? Did we |
1527 forget them? Well, they have a place in the picture of calculating a |
1540 just forget them? Well, they have a place in the picture of |
1528 DFA from the regular expression. |
1541 calculating a DFA from the regular expression. |
1529 |
1542 |
1530 To be done |
1543 To be done |
1531 |
1544 |
|
1545 \begin{center} |
|
1546 \begin{tikzpicture} |
|
1547 [level distance=25mm,very thick,auto, |
|
1548 level 1/.style={sibling distance=30mm}, |
|
1549 level 2/.style={sibling distance=15mm}, |
|
1550 every node/.style={minimum size=30pt, |
|
1551 inner sep=0pt,circle,draw=blue!50,very thick, |
|
1552 fill=blue!20}] |
|
1553 \node {$r$} [grow=right] |
|
1554 child[->] {node (cn) {$d_{c_n}$} |
|
1555 child { node {$dd_{c_nc_n}$}} |
|
1556 child { node {$dd_{c_nc_1}$}} |
|
1557 %edge from parent node[left] {$c_n$} |
|
1558 } |
|
1559 child[->] {node (c1) {$d_{c_1}$} |
|
1560 child { node {$dd_{c_1c_n}$}} |
|
1561 child { node {$dd_{c_1c_1}$}} |
|
1562 %edge from parent node[left] {$c_1$} |
|
1563 }; |
|
1564 %%\draw (cn) -- (c1) node {\vdots}; |
|
1565 \end{tikzpicture} |
|
1566 \end{center} |
1532 |
1567 |
1533 %\section*{Further Reading} |
1568 %\section*{Further Reading} |
1534 |
1569 |
1535 %Compare what a ``human expert'' would create as an automaton for the |
1570 %Compare what a ``human expert'' would create as an automaton for the |
1536 %regular expression $a\cdot (b + c)^*$ and what the Thomson |
1571 %regular expression $a\cdot (b + c)^*$ and what the Thomson |