--- a/handouts/ho03.tex Tue May 09 12:31:55 2017 +0100
+++ b/handouts/ho03.tex Wed May 10 17:03:21 2017 +0100
@@ -465,7 +465,13 @@
for a ``proper'' transition consuming a character. The functions in
Lines 18--26 calculate
all states reachable by one or more $\epsilon$-transition for a given
- set of states. The NFA is constructed in Lines 36--38.\label{enfa}}
+ set of states. The NFA is constructed in Lines 36--38.
+ Note the interesting commands in Lines 5 and 6: their purpose is
+ to ensure that \texttt{fixpT} is the tail-recursive version of
+ the fixpoint construction; otherwise we would quickly get a
+ stack-overflow exception, even on small examples, due to limitations
+ of the JVM.
+ \label{enfa}}
\end{figure}
Also look carefully how the transitions of $\epsilon$NFAs are
@@ -886,16 +892,17 @@
\subsection*{Subset Construction}
-Of course, some developers of regular expression matchers are aware
-of these problems with sluggish NFAs and try to address them. One
-common technique for this I like to show you in this section. It will
-also explain why we insisted on polymorphic types in our DFA code
-(remember I used \texttt{A} and \texttt{C} for the types of states and
-the input, see Figure~\ref{dfa} on Page~\pageref{dfa}).\bigskip
+Of course, some developers of regular expression matchers are aware of
+these problems with sluggish NFAs and try to address them. One common
+technique for alleviating the problem I like to show you in this
+section. This will also explain why we insisted on polymorphic types in
+our DFA code (remember I used \texttt{A} and \texttt{C} for the types
+of states and the input, see Figure~\ref{dfa} on
+Page~\pageref{dfa}).\bigskip
\noindent
-To start, remember that we did not bother with defining and
-implementing $\epsilon$NFA; we immediately translated them into
+To start remember that we did not bother with defining and
+implementing $\epsilon$NFAs: we immediately translated them into
equivalent NFAs. Equivalent in the sense of accepting the same
language (though we only claimed this and did not prove it
rigorously). Remember also that NFAs have non-deterministic
@@ -903,20 +910,20 @@
sets of states. This non-determinism is crucial for the Thompson
Construction to work (recall the cases for $\cdot$, $+$ and
${}^*$). But this non-determinism makes it harder with NFAs to decide
-when a string is accepted or not; such a decision is rather
+when a string is accepted or not; whereas such a decision is rather
straightforward with DFAs: recall their transition function is a
-\emph{function} that returns a single state. So we do not have to
-search at all. What is perhaps interesting is the fact that for every
-NFA we can find a DFA that also recognises the same language. This
-might sound a bit paradoxical: NFA $\rightarrow$ decision of
-acceptance hard; DFA $\rightarrow$ decision easy. But this \emph{is}
-true\ldots but of course there is always a caveat---nothing is ever
-for free in life.
+\emph{function} that returns a single state. So with DFAs we do not
+have to search at all. What is perhaps interesting is the fact that
+for every NFA we can find a DFA that also recognises the same
+language. This might sound a bit paradoxical: NFA $\rightarrow$
+decision of acceptance hard; DFA $\rightarrow$ decision easy. But this
+\emph{is} true\ldots but of course there is always a caveat---nothing
+ever is for free in life.
-There are a number of techniques for transforming a NFA into an
-equivalent DFA, but the most famous one is the \emph{subset
+There are actually a number of methods for transforming a NFA into
+an equivalent DFA, but the most famous one is the \emph{subset
construction}. Consider the following NFA where the states are
-labelled with, say, $0$, $1$ and $2$.
+labelled with $0$, $1$ and $2$.
\begin{center}
\begin{tabular}{c@{\hspace{10mm}}c}
@@ -949,9 +956,9 @@
\end{center}
\noindent The states of the corresponding DFA are given by generating
-all subsets of the set of states of the NFA (seen in the states column
+all subsets of the set $\{0,1,2\}$ (seen in the states column
in the table on the right). The other columns define the transition
-function for the DFA for input $a$ and $b$. The first row states that
+function for the DFA for inputs $a$ and $b$. The first row states that
$\{\}$ is the sink state which has transitions for $a$ and $b$ to
itself. The next three lines are calculated as follows:
@@ -964,29 +971,32 @@
$1$ in the NFA; therefore in the DFA we can go from state $\{0\}$ to
state $\{0,1\}$ via an $b$-transition.
\item Continue with the states $\{1\}$ and $\{2\}$.
-\item Once you filled in the transitions for `simple' state, you only
- have to build the union for the compound states $\{0,1\}$, $\{0,2\}$
- and so on. For example for $\{0,1\}$ you take the union of line
- $\{0\}$ and line $\{1\}$, which gives $\{0,2\}$ for $a$, and
- $\{0,1,2\}$ for $b$. And so on.
-\item The starting state of the DFA can be calculated from the
- starting states of the NFA, that is in this case $0$. But in general
- there can be many starting states in the NFA and you would take the
- corresponding subset as \emph{the} starting state of the DFA.
-\item The accepting states in the DFA are given by all sets that
- contain a $2$, which is the only accpting state in this NFA. But
- again in general if the subset contains an accepting state from the
- NFA, then the corresponding state in the DFA is accepting as well.
\end{itemize}
-\noindent This completes the subset construction. The corresponding
-DFA for the NFA shown above is:
+\noindent
+Once you filled in the transitions for `simple' states $\{0\}$
+.. $\{2\}$, you only have to build the union for the compound states
+$\{0,1\}$, $\{0,2\}$ and so on. For example for $\{0,1\}$ you take the
+union of Line $\{0\}$ and Line $\{1\}$, which gives $\{0,2\}$ for $a$,
+and $\{0,1,2\}$ for $b$. And so on.
-\begin{center}
+The starting state of the DFA can be calculated from the starting
+states of the NFA, that is in this case $\{0\}$. But in general there
+can of course be many starting states in the NFA and you would take
+the corresponding subset as \emph{the} starting state of the DFA.
+
+The accepting states in the DFA are given by all sets that contain a
+$2$, which is the only accpting state in this NFA. But again in
+general if the subset contains any accepting state from the NFA, then
+the corresponding state in the DFA is accepting as well. This
+completes the subset construction. The corresponding DFA for the NFA
+shown above is:
+
+\begin{equation}
\begin{tikzpicture}[scale=0.8,>=stealth',very thick,
every state/.style={minimum size=0pt,
- draw=blue!50,very thick,fill=blue!20},
- baseline=0mm]
+ draw=blue!50,very thick,fill=blue!20},
+ baseline=(current bounding box.center)]
\node[state,initial] (q0) {$0$};
\node[state] (q01) [right=of q0] {$0,1$};
\node[state,accepting] (q02) [below=of q01] {$0,2$};
@@ -1008,34 +1018,39 @@
\path[->] (q12) edge node [right] {$a, b$} (q2);
\path[->] (q2) edge node [right] {$a, b$} (qn);
\path[->] (qn) edge [loop left] node {$a,b$} ();
-\end{tikzpicture}
-\end{center}
+\end{tikzpicture}\label{subsetdfa}
+\end{equation}
\noindent
Please check that this is indeed a DFA. The big question is whether
-this DFA can recognise the same language as the NFA we started with.
+this DFA can recognise the same language as the NFA we started with?
I let you ponder about this question.
-There are also two points to note: One is that very often the
-resulting DFA contains a number of ``dead'' states that are never
-reachable from the starting state. This is obvious in this case, where
-state $\{1\}$, $\{2\}$, $\{1,2\}$ and $\{\}$ can never be reached from
-the starting state. In effect the DFA in this example is not a
-\emph{minimal} DFA (more about this in a minute). Such dead states can
-be safely removed without changing the language that is recognised by
-the DFA. Another point is that in some cases, however, the subset
-construction produces a DFA that does \emph{not} contain any dead
-states\ldots{}and further calculates a minimal DFA. Which in turn
-means that in some cases the number of states can by going from NFAs
-to DFAs exponentially increase, namely by $2^n$ (which is the number
-of subsets you can form for $n$ states). This blow up the number of
-states in the DFA is again bad news for how quickly you can decide
-whether a string is accepted by a DFA or not. So the caveat with DFAs
-is that they might make the task of finding the next state trival, but
-might require $2^n$ times as many states as a NFA.\bigskip
+There are also two points to note: One is that very often in the
+subset construction the resulting DFA contains a number of ``dead''
+states that are never reachable from the starting state. This is
+obvious in the example, where state $\{1\}$, $\{2\}$, $\{1,2\}$ and
+$\{\}$ can never be reached from the starting state. But this might
+not always be as obvious as that. In effect the DFA in this example is
+not a \emph{minimal} DFA (more about this in a minute). Such dead
+states can be safely removed without changing the language that is
+recognised by the DFA. Another point is that in some cases, however,
+the subset construction produces a DFA that does \emph{not} contain
+any dead states\ldots{}this means it calculates a minimal DFA. Which
+in turn means that in some cases the number of states can by going
+from NFAs to DFAs exponentially increase, namely by $2^n$ (which is
+the number of subsets you can form for sets of $n$ states). This blow
+up in the number of states in the DFA is again bad news for how
+quickly you can decide whether a string is accepted by a DFA or
+not. So the caveat with DFAs is that they might make the task of
+finding the next state trival, but might require $2^n$ times as many
+states then a NFA.\bigskip
-Lastly, can we
+\noindent
+To conclude this section, how conveniently we can
+implement the subset construction with our versions of NFAs and
+DFAs? Very conveninetly. The code is just:
{\small\begin{lstlisting}[language=Scala]
def subset[A, C](nfa: NFA[A, C]) : DFA[Set[A], C] = {
@@ -1045,14 +1060,45 @@
}
\end{lstlisting}}
+\noindent
+The interesting point in this code is that the state type of the
+calculated DFA is \texttt{Set[A]}. Think carefully that this works out
+correctly.
+The DFA is then given by three components: the starting states, the
+transition function and the accepting-states function. The starting
+states are a set in the given NFA, but a single state in the DFA. The
+transition function, given the state \texttt{qs} and input \texttt{c},
+needs to produce the next state: this is the set of all NFA states
+that are reachable from each state in \texttt{qs}. The function
+\texttt{nexts} from the NFA class already calculates this for us. The
+accepting-states function for the DFA is true henevner at least one
+state in the subset is accepting (that is true) in the NFA.\medskip
+
+\noindent
+You might be able to spend some quality tinkering with this code and
+time to ponder. Then you will probably notice it is actually
+silly. The whole point of translating the NFA into a DFA via the
+subset construction is to make the decision of whether a string is
+accepted or not faster. Given the code above, the generated DFA will
+be exactly as fast, or as slow, as the NFA we started with (actually
+it will even be a tiny bit slower). The reason is that we just re-use
+the \texttt{nexts} function from the NFA. This fucntion implements the
+non-deterministic breadth-first search. You might be thinking: That
+is cheating! \ldots{} Well, not quite as you will see later, but in
+terms of speed we still need to work a bit in order to get
+sometimes(!) a faster DFA. Let's do this next.
\subsection*{DFA Minimisation}
-As seen in the subset construction, the translation
-of a NFA to a DFA can result in a rather ``inefficient''
-DFA. Meaning there are states that are not needed. A
-DFA can be \emph{minimised} by the following algorithm:
+As seen in \eqref{subsetdfa}, the subset construction from NFA to a
+DFA can result in a rather ``inefficient'' DFA. Meaning there are
+states that are not needed. There are two kinds of such unneeded
+states: \emph{unreachable} states and \emph{nondistinguishable}
+states. The first kind of states can just be removed without affecting
+the language that can be recognised (after all they are
+unreachable). The second kind can also be recognised and thus a DFA
+can be \emph{minimised} by the following algorithm:
\begin{enumerate}
\item Take all pairs $(q, p)$ with $q \not= p$
@@ -1069,8 +1115,9 @@
\item All unmarked pairs can be merged.
\end{enumerate}
-\noindent To illustrate this algorithm, consider the following
-DFA.
+\noindent Unfortunately, once we throw away all unreachable states in
+\eqref{subsetdfa}, all remaining states are needed. In order to
+illustrate the minimisation algorithm, consider the following DFA.
\begin{center}
\begin{tikzpicture}[>=stealth',very thick,auto,
@@ -1134,7 +1181,7 @@
accepting ($Q_4$) and a state that is non-accepting (the other
states).
-Now in Step 3 we need to fill in more stars according whether
+In Step 3 we need to fill in more stars according whether
one of the next-state pairs are marked. We have to do this
for every unmarked field until there is no change anymore.
This gives the triangle
@@ -1197,10 +1244,24 @@
\subsection*{Brzozowski's Method}
-As said before, we can also go into the other direction---from
-DFAs to regular expressions. Brzozowski's method calculates
-a regular expression using familiar transformations for
-solving equational systems. Consider the DFA:
+I know tyhis is already a long, long rant: but after all it is a topic
+that has been researched for more than 60 years. If you reflect on
+what you have read so far, the story you can take a regular
+expression, translate it via the Thompson Construction into an
+$\epsilon$NFA, then translate it into a NFA by removing all
+$\epsilon$-transitions, and then via the subset construction obtain a
+DFA. In all steps we made sure the language, or which strings can be
+recognised, stays the same. After the last section, we can even
+minimise the DFA. But again we made sure the same language is
+recognised. You might be wondering: Can we go into the other
+direction? Can we go from a DFA and obtain a regular expression that
+can recognise the same language as the DFA?\medskip
+
+\noindent
+The answer is yes. Again there are several methods for calculating a
+regular expression for a DFA. I will show you Brzozowski's method
+because it calculates a regular expression using quite familiar
+transformations for solving equational systems. Consider the DFA:
\begin{center}
\begin{tikzpicture}[scale=1.5,>=stealth',very thick,auto,
@@ -1329,21 +1390,20 @@
\noindent Finally, we only need to ``add'' up the equations
which correspond to a terminal state. In our running example,
this is just $Q_2$. Consequently, a regular expression
-that recognises the same language as the automaton is
+that recognises the same language as the DFA is
\[
(b + a\,b + a\,a\,(a^*)\,b)^*\,a\,a\,(a)^*
\]
-\noindent You can somewhat crosscheck your solution
-by taking a string the regular expression can match and
-and see whether it can be matched by the automaton.
-One string for example is $aaa$ and \emph{voila} this
+\noindent You can somewhat crosscheck your solution by taking a string
+the regular expression can match and and see whether it can be matched
+by the DFA. One string for example is $aaa$ and \emph{voila} this
string is also matched by the automaton.
-We should prove that Brzozowski's method really produces
-an equivalent regular expression for the automaton. But
-for the purposes of this module, we omit this.
+We should prove that Brzozowski's method really produces an equivalent
+regular expression. But for the purposes of this module, we omit
+this. I guess you are relieved.
\subsection*{Regular Languages}
@@ -1372,7 +1432,7 @@
there exists a regular expression that can recognise the same
language. Again we did not prove this fact.
-The interesting conclusion is that automata and regular
+The fundamental conclusion we can draw is that automata and regular
expressions can recognise the same set of languages:
\begin{quote} A language is \emph{regular} iff there exists a
@@ -1385,70 +1445,68 @@
automaton that recognises all its strings.
\end{quote}
-\noindent So for deciding whether a string is recognised by a
-regular expression, we could use our algorithm based on
-derivatives or NFAs or DFAs. But let us quickly look at what
-the differences mean in computational terms. Translating a
-regular expression into a NFA gives us an automaton that has
-$O(n)$ states---that means the size of the NFA grows linearly
-with the size of the regular expression. The problem with NFAs
-is that the problem of deciding whether a string is accepted
-or not is computationally not cheap. Remember with NFAs we
-have potentially many next states even for the same input and
-also have the silent $\epsilon$-transitions. If we want to
-find a path from the starting state of a NFA to an accepting
-state, we need to consider all possibilities. In Ruby and
-Python this is done by a depth-first search, which in turn
-means that if a ``wrong'' choice is made, the algorithm has to
-backtrack and thus explore all potential candidates. This is
-exactly the reason why Ruby and Python are so slow for evil
-regular expressions. An alternative to the potentially slow
-depth-first search is to explore the search space in a
-breadth-first fashion, but this might incur a big memory
-penalty.
+\noindent Note that this is not a stement for a particular language
+(that is a particular set of strings), but about a large class of
+languages, namely the regular ones.
-To avoid the problems with NFAs, we can translate them
-into DFAs. With DFAs the problem of deciding whether a
-string is recognised or not is much simpler, because in
-each state it is completely determined what the next
-state will be for a given input. So no search is needed.
-The problem with this is that the translation to DFAs
-can explode exponentially the number of states. Therefore when
-this route is taken, we definitely need to minimise the
-resulting DFAs in order to have an acceptable memory
-and runtime behaviour. But remember the subset construction
-in the worst case explodes the number of states by $2^n$.
-Effectively also the translation to DFAs can incur a big
+As a consequence for deciding whether a string is recognised by a
+regular expression, we could use our algorithm based on derivatives or
+NFAs or DFAs. But let us quickly look at what the differences mean in
+computational terms. Translating a regular expression into a NFA gives
+us an automaton that has $O(n)$ states---that means the size of the
+NFA grows linearly with the size of the regular expression. The
+problem with NFAs is that the problem of deciding whether a string is
+accepted or not is computationally not cheap. Remember with NFAs we
+have potentially many next states even for the same input and also
+have the silent $\epsilon$-transitions. If we want to find a path from
+the starting state of a NFA to an accepting state, we need to consider
+all possibilities. In Ruby, Python and Java this is done by a
+depth-first search, which in turn means that if a ``wrong'' choice is
+made, the algorithm has to backtrack and thus explore all potential
+candidates. This is exactly the reason why Ruby, Python and Java are
+so slow for evil regular expressions. An alternative to the
+potentially slow depth-first search is to explore the search space in
+a breadth-first fashion, but this might incur a big memory penalty.
+
+To avoid the problems with NFAs, we can translate them into DFAs. With
+DFAs the problem of deciding whether a string is recognised or not is
+much simpler, because in each state it is completely determined what
+the next state will be for a given input. So no search is needed. The
+problem with this is that the translation to DFAs can explode
+exponentially the number of states. Therefore when this route is
+taken, we definitely need to minimise the resulting DFAs in order to
+have an acceptable memory and runtime behaviour. But remember the
+subset construction in the worst case explodes the number of states by
+$2^n$. Effectively also the translation to DFAs can incur a big
runtime penalty.
-But this does not mean that everything is bad with automata.
-Recall the problem of finding a regular expressions for the
-language that is \emph{not} recognised by a regular
-expression. In our implementation we added explicitly such a
-regular expressions because they are useful for recognising
-comments. But in principle we did not need to. The argument
-for this is as follows: take a regular expression, translate
+But this does not mean that everything is bad with automata. Recall
+the problem of finding a regular expressions for the language that is
+\emph{not} recognised by a regular expression. In our implementation
+we added explicitly such a regular expressions because they are useful
+for recognising comments. But in principle we did not need to. The
+argument for this is as follows: take a regular expression, translate
it into a NFA and then a DFA that both recognise the same
-language. Once you have the DFA it is very easy to construct
-the automaton for the language not recognised by a DFA. If
-the DFA is completed (this is important!), then you just need
-to exchange the accepting and non-accepting states. You can
-then translate this DFA back into a regular expression and
-that will be the regular expression that can match all strings
-the original regular expression could \emph{not} match.
+language. Once you have the DFA it is very easy to construct the
+automaton for the language not recognised by a DFA. If the DFA is
+completed (this is important!), then you just need to exchange the
+accepting and non-accepting states. You can then translate this DFA
+back into a regular expression and that will be the regular expression
+that can match all strings the original regular expression could
+\emph{not} match.
-It is also interesting that not all languages are regular. The
-most well-known example of a language that is not regular
-consists of all the strings of the form
+It is also interesting that not all languages are regular. The most
+well-known example of a language that is not regular consists of all
+the strings of the form
\[a^n\,b^n\]
-\noindent meaning strings that have the same number of $a$s
-and $b$s. You can try, but you cannot find a regular
-expression for this language and also not an automaton. One
-can actually prove that there is no regular expression nor
-automaton for this language, but again that would lead us too
-far afield for what we want to do in this module.
+\noindent meaning strings that have the same number of $a$s and
+$b$s. You can try, but you cannot find a regular expression for this
+language and also not an automaton. One can actually prove that there
+is no regular expression nor automaton for this language, but again
+that would lead us too far afield for what we want to do in this
+module.
%\section*{Further Reading}