afl-material: comparison handouts/ho03.tex

equal deleted inserted replaced

-:4fee50f38305
+:d5776c6018f0
 transition function of $\epsilon$NFA takes as input an \texttt{Option[C]}.
 \texttt{None} stands for an $\epsilon$-transition; \texttt{Some(c)}
 for a ``proper'' transition consuming a character. The functions in
 Lines 18--26 calculate
 all states reachable by one or more $\epsilon$-transition for a given
-set of states. The NFA is constructed in Lines 36--38.\label{enfa}}
+set of states. The NFA is constructed in Lines 36--38.
+Note the interesting commands in Lines 5 and 6: their purpose is
+to ensure that \texttt{fixpT} is the tail-recursive version of
+the fixpoint construction; otherwise we would quickly get a
+stack-overflow exception, even on small examples, due to limitations
+of the JVM.
+\label{enfa}}
 \end{figure}
 Also look carefully how the transitions of $\epsilon$NFAs are
 implemented. The additional possibility of performing silent
 transitions is encoded by using \texttt{Option[C]} as the type for the
 \subsection*{Subset Construction}
-Of course, some developers of regular expression matchers are aware
+Of course, some developers of regular expression matchers are aware of
-of these problems with sluggish NFAs and try to address them. One
+these problems with sluggish NFAs and try to address them. One common
-common technique for this I like to show you in this section. It will
+technique for alleviating the problem I like to show you in this
-also explain why we insisted on polymorphic types in our DFA code
+section. This will also explain why we insisted on polymorphic types in
-(remember I used \texttt{A} and \texttt{C} for the types of states and
+our DFA code (remember I used \texttt{A} and \texttt{C} for the types
-the input, see Figure~\ref{dfa} on Page~\pageref{dfa}).\bigskip
+of states and the input, see Figure~\ref{dfa} on
+Page~\pageref{dfa}).\bigskip
 \noindent
-To start, remember that we did not bother with defining and
+To start remember that we did not bother with defining and
-implementing $\epsilon$NFA; we immediately translated them into
+implementing $\epsilon$NFAs: we immediately translated them into
 equivalent NFAs. Equivalent in the sense of accepting the same
 language (though we only claimed this and did not prove it
 rigorously). Remember also that NFAs have non-deterministic
 transitions defined as a relation or implemented as function returning
 sets of states.  This non-determinism is crucial for the Thompson
 Construction to work (recall the cases for $\cdot$, $+$ and
 ${}^*$). But this non-determinism makes it harder with NFAs to decide
-when a string is accepted or not; such a decision is rather
+when a string is accepted or not; whereas such a decision is rather
 straightforward with DFAs: recall their transition function is a
-\emph{function} that returns a single state. So we do not have to
+\emph{function} that returns a single state. So with DFAs we do not
-search at all.  What is perhaps interesting is the fact that for every
+have to search at all.  What is perhaps interesting is the fact that
-NFA we can find a DFA that also recognises the same language. This
+for every NFA we can find a DFA that also recognises the same
-might sound a bit paradoxical: NFA $\rightarrow$ decision of
+language. This might sound a bit paradoxical: NFA $\rightarrow$
-acceptance hard; DFA $\rightarrow$ decision easy. But this \emph{is}
+decision of acceptance hard; DFA $\rightarrow$ decision easy. But this
-true\ldots but of course there is always a caveat---nothing is ever
+\emph{is} true\ldots but of course there is always a caveat---nothing
-for free in life.
+ever is for free in life.
-There are a number of techniques for transforming a NFA into an
+There are actually a number of methods for transforming a NFA into
-equivalent DFA, but the most famous one is the \emph{subset
+an equivalent DFA, but the most famous one is the \emph{subset
 construction}. Consider the following NFA where the states are
-labelled with, say, $0$, $1$ and $2$.
+labelled with $0$, $1$ and $2$.
 \begin{center}
 \begin{tabular}{c@{\hspace{10mm}}c}
 \begin{tikzpicture}[scale=0.7,>=stealth',very thick,
 every state/.style={minimum size=0pt,
 \end{tabular}
 \end{tabular}
 \end{center}
 \noindent The states of the corresponding DFA are given by generating
-all subsets of the set of states of the NFA (seen in the states column
+all subsets of the set $\{0,1,2\}$ (seen in the states column
 in the table on the right). The other columns define the transition
-function for the DFA for input $a$ and $b$. The first row states that
+function for the DFA for inputs $a$ and $b$. The first row states that
 $\{\}$ is the sink state which has transitions for $a$ and $b$ to
 itself.  The next three lines are calculated as follows:
 \begin{itemize}
 \item Suppose you calculate the entry for the $a$-transition for state
 state $\{0\}$ we can go to state $\{0\}$ via an $a$-transition.
 \item Do the same for the $b$-transition; you can reach states $0$ and
 $1$ in the NFA; therefore in the DFA we can go from state $\{0\}$ to
 state $\{0,1\}$ via an $b$-transition.
 \item Continue with the states $\{1\}$ and $\{2\}$.
-\item Once you filled in the transitions for `simple' state, you only
-have to build the union for the compound states $\{0,1\}$, $\{0,2\}$
-and so on. For example for $\{0,1\}$ you take the union of line
-$\{0\}$ and line $\{1\}$, which gives $\{0,2\}$ for $a$, and
-$\{0,1,2\}$ for $b$. And so on.
-\item The starting state of the DFA can be calculated from the
-starting states of the NFA, that is in this case $0$. But in general
-there can be many starting states in the NFA and you would take the
-corresponding subset as \emph{the} starting state of the DFA.
-\item The accepting states in the DFA are given by all sets that
-contain a $2$, which is the only accpting state in this NFA. But
-again in general if the subset contains an accepting state from the
-NFA, then the corresponding state in the DFA is accepting as well.
 \end{itemize}
-\noindent This completes the subset construction. The corresponding
+\noindent
-DFA for the NFA shown above is:
+Once you filled in the transitions for `simple' states $\{0\}$
+.. $\{2\}$, you only have to build the union for the compound states
-\begin{center}
+$\{0,1\}$, $\{0,2\}$ and so on. For example for $\{0,1\}$ you take the
+union of Line $\{0\}$ and Line $\{1\}$, which gives $\{0,2\}$ for $a$,
+and $\{0,1,2\}$ for $b$. And so on.
+The starting state of the DFA can be calculated from the starting
+states of the NFA, that is in this case $\{0\}$. But in general there
+can of course be many starting states in the NFA and you would take
+the corresponding subset as \emph{the} starting state of the DFA.
+The accepting states in the DFA are given by all sets that contain a
+$2$, which is the only accpting state in this NFA. But again in
+general if the subset contains any accepting state from the NFA, then
+the corresponding state in the DFA is accepting as well.  This
+completes the subset construction. The corresponding DFA for the NFA
+shown above is:
+\begin{equation}
 \begin{tikzpicture}[scale=0.8,>=stealth',very thick,
 every state/.style={minimum size=0pt,
 draw=blue!50,very thick,fill=blue!20},
-baseline=0mm]
+baseline=(current bounding box.center)]
 \node[state,initial]  (q0)  {$0$};
 \node[state] (q01) [right=of q0] {$0,1$};
 \node[state,accepting] (q02) [below=of q01] {$0,2$};
 \node[state,accepting] (q012) [right=of q02] {$0,1,2$};
 \node[state] (q1) [below=0.5cm of q0] {$1$};
 \path[->] (q02) edge [bend left] node [right]  {$b$} (q01);
 \path[->] (q1) edge node [left] {$a,b$} (q2);
 \path[->] (q12) edge node [right] {$a, b$} (q2);
 \path[->] (q2) edge node [right] {$a, b$} (qn);
 \path[->] (qn) edge [loop left] node  {$a,b$} ();
-\end{tikzpicture}
+\end{tikzpicture}\label{subsetdfa}
-\end{center}
+\end{equation}
 \noindent
 Please check that this is indeed a DFA. The big question is whether
-this DFA can recognise the same language as the NFA we started with.
+this DFA can recognise the same language as the NFA we started with?
 I let you ponder about this question.
-There are also two points to note: One is that very often the
+There are also two points to note: One is that very often in the
-resulting DFA contains a number of ``dead'' states that are never
+subset construction the resulting DFA contains a number of ``dead''
-reachable from the starting state. This is obvious in this case, where
+states that are never reachable from the starting state. This is
-state $\{1\}$, $\{2\}$, $\{1,2\}$ and $\{\}$ can never be reached from
+obvious in the example, where state $\{1\}$, $\{2\}$, $\{1,2\}$ and
-the starting state.  In effect the DFA in this example is not a
+$\{\}$ can never be reached from the starting state. But this might
-\emph{minimal} DFA (more about this in a minute). Such dead states can
+not always be as obvious as that. In effect the DFA in this example is
-be safely removed without changing the language that is recognised by
+not a \emph{minimal} DFA (more about this in a minute). Such dead
-the DFA. Another point is that in some cases, however, the subset
+states can be safely removed without changing the language that is
-construction produces a DFA that does \emph{not} contain any dead
+recognised by the DFA. Another point is that in some cases, however,
-states\ldots{}and further calculates a minimal DFA. Which in turn
+the subset construction produces a DFA that does \emph{not} contain
-means that in some cases the number of states can by going from NFAs
+any dead states\ldots{}this means it calculates a minimal DFA. Which
-to DFAs exponentially increase, namely by $2^n$ (which is the number
+in turn means that in some cases the number of states can by going
-of subsets you can form for $n$ states).  This blow up the number of
+from NFAs to DFAs exponentially increase, namely by $2^n$ (which is
-states in the DFA is again bad news for how quickly you can decide
+the number of subsets you can form for sets of $n$ states).  This blow
-whether a string is accepted by a DFA or not. So the caveat with DFAs
+up in the number of states in the DFA is again bad news for how
-is that they might make the task of finding the next state trival, but
+quickly you can decide whether a string is accepted by a DFA or
-might require $2^n$ times as many states as a NFA.\bigskip
+not. So the caveat with DFAs is that they might make the task of
+finding the next state trival, but might require $2^n$ times as many
-Lastly, can we
+states then a NFA.\bigskip
+\noindent
+To conclude this section, how conveniently we can
+implement the subset construction with our versions of NFAs and
+DFAs? Very conveninetly. The code is just:
 {\small\begin{lstlisting}[language=Scala]
 def subset[A, C](nfa: NFA[A, C]) : DFA[Set[A], C] = {
 DFA(nfa.starts,
 { case (qs, c) => nfa.nexts(qs, c) },
 _.exists(nfa.fins))
 }
 \end{lstlisting}}
+\noindent
+The interesting point in this code is that the state type of the
+calculated DFA is \texttt{Set[A]}. Think carefully that this works out
+correctly.
+The DFA is then given by three components: the starting states, the
+transition function and the accepting-states function.  The starting
+states are a set in the given NFA, but a single state in the DFA.  The
+transition function, given the state \texttt{qs} and input \texttt{c},
+needs to produce the next state: this is the set of all NFA states
+that are reachable from each state in \texttt{qs}. The function
+\texttt{nexts} from the NFA class already calculates this for us. The
+accepting-states function for the DFA is true henevner at least one
+state in the subset is accepting (that is true) in the NFA.\medskip
+\noindent
+You might be able to spend some quality tinkering with this code and
+time to ponder. Then you will probably notice it is actually
+silly. The whole point of translating the NFA into a DFA via the
+subset construction is to make the decision of whether a string is
+accepted or not faster. Given the code above, the generated DFA will
+be exactly as fast, or as slow, as the NFA we started with (actually
+it will even be a tiny bit slower). The reason is that we just re-use
+the \texttt{nexts} function from the NFA. This fucntion implements the
+non-deterministic breadth-first search.  You might be thinking: That
+is cheating! \ldots{} Well, not quite as you will see later, but in
+terms of speed we still need to work a bit in order to get
+sometimes(!) a faster DFA. Let's do this next.
 \subsection*{DFA Minimisation}
-As seen in the subset construction, the translation
+As seen in \eqref{subsetdfa}, the subset construction from NFA to a
-of a NFA to a DFA can result in a rather ``inefficient''
+DFA can result in a rather ``inefficient'' DFA. Meaning there are
-DFA. Meaning there are states that are not needed. A
+states that are not needed. There are two kinds of such unneeded
-DFA can be \emph{minimised} by the following algorithm:
+states: \emph{unreachable} states and \emph{nondistinguishable}
+states. The first kind of states can just be removed without affecting
+the language that can be recognised (after all they are
+unreachable). The second kind can also be recognised and thus a DFA
+can be \emph{minimised} by the following algorithm:
 \begin{enumerate}
 \item Take all pairs $(q, p)$ with $q \not= p$
 \item Mark all pairs that accepting and non-accepting states
 \item For all unmarked pairs $(q, p)$ and all characters $c$
 are marked. If there is one, then also mark $(q, p)$.
 \item Repeat last step until no change.
 \item All unmarked pairs can be merged.
 \end{enumerate}
-\noindent To illustrate this algorithm, consider the following
+\noindent Unfortunately, once we throw away all unreachable states in
-DFA.
+\eqref{subsetdfa}, all remaining states are needed.  In order to
+illustrate the minimisation algorithm, consider the following DFA.
 \begin{center}
 \begin{tikzpicture}[>=stealth',very thick,auto,
 every state/.style={minimum size=0pt,
 inner sep=2pt,draw=blue!50,very thick,
 \noindent where the lower row is filled with stars, because in
 the corresponding pairs there is always one state that is
 accepting ($Q_4$) and a state that is non-accepting (the other
 states).
-Now in Step 3 we need to fill in more stars according whether
+In Step 3 we need to fill in more stars according whether
 one of the next-state pairs are marked. We have to do this
 for every unmarked field until there is no change anymore.
 This gives the triangle
 \begin{center}
 \end{center}
 \subsection*{Brzozowski's Method}
-As said before, we can also go into the other direction---from
+I know tyhis is already a long, long rant: but after all it is a topic
-DFAs to regular expressions. Brzozowski's method calculates
+that has been researched for more than 60 years. If you reflect on
-a regular expression using familiar transformations for
+what you have read so far, the story you can take a regular
-solving equational systems. Consider the DFA:
+expression, translate it via the Thompson Construction into an
+$\epsilon$NFA, then translate it into a NFA by removing all
+$\epsilon$-transitions, and then via the subset construction obtain a
+DFA. In all steps we made sure the language, or which strings can be
+recognised, stays the same. After the last section, we can even
+minimise the DFA. But again we made sure the same language is
+recognised. You might be wondering: Can we go into the other
+direction?  Can we go from a DFA and obtain a regular expression that
+can recognise the same language as the DFA?\medskip
+\noindent
+The answer is yes. Again there are several methods for calculating a
+regular expression for a DFA. I will show you Brzozowski's method
+because it calculates a regular expression using quite familiar
+transformations for solving equational systems. Consider the DFA:
 \begin{center}
 \begin{tikzpicture}[scale=1.5,>=stealth',very thick,auto,
 every state/.style={minimum size=0pt,
 inner sep=2pt,draw=blue!50,very thick,
 \end{eqnarray}
 \noindent Finally, we only need to ``add'' up the equations
 which correspond to a terminal state. In our running example,
 this is just $Q_2$. Consequently, a regular expression
-that recognises the same language as the automaton is
+that recognises the same language as the DFA is
 \[
 (b + a\,b + a\,a\,(a^*)\,b)^*\,a\,a\,(a)^*
 \]
-\noindent You can somewhat crosscheck your solution
+\noindent You can somewhat crosscheck your solution by taking a string
-by taking a string the regular expression can match and
+the regular expression can match and and see whether it can be matched
-and see whether it can be matched by the automaton.
+by the DFA.  One string for example is $aaa$ and \emph{voila} this
-One string for example is $aaa$ and \emph{voila} this
 string is also matched by the automaton.
-We should prove that Brzozowski's method really produces
+We should prove that Brzozowski's method really produces an equivalent
-an equivalent  regular expression for the automaton. But
+regular expression. But for the purposes of this module, we omit
-for the purposes of this module, we omit this.
+this. I guess you are relieved.
 \subsection*{Regular Languages}
 Given the constructions in the previous sections we obtain
 Although we did not prove this fact. Similarly by going from
 DFAs to regular expressions, we can make sure for every DFA
 there exists a regular expression that can recognise the same
 language. Again we did not prove this fact.
-The interesting conclusion is that automata and regular
+The fundamental conclusion we can draw is that automata and regular
 expressions can recognise the same set of languages:
 \begin{quote} A language is \emph{regular} iff there exists a
 regular expression that recognises all its strings.
 \end{quote}
 \begin{quote} A language is \emph{regular} iff there exists an
 automaton that recognises all its strings.
 \end{quote}
-\noindent So for deciding whether a string is recognised by a
+\noindent Note that this is not a stement for a particular language
-regular expression, we could use our algorithm based on
+(that is a particular set of strings), but about a large class of
-derivatives or NFAs or DFAs. But let us quickly look at what
+languages, namely the regular ones.
-the differences mean in computational terms. Translating a
-regular expression into a NFA gives us an automaton that has
+As a consequence for deciding whether a string is recognised by a
-$O(n)$ states---that means the size of the NFA grows linearly
+regular expression, we could use our algorithm based on derivatives or
-with the size of the regular expression. The problem with NFAs
+NFAs or DFAs. But let us quickly look at what the differences mean in
-is that the problem of deciding whether a string is accepted
+computational terms. Translating a regular expression into a NFA gives
-or not is computationally not cheap. Remember with NFAs we
+us an automaton that has $O(n)$ states---that means the size of the
-have potentially many next states even for the same input and
+NFA grows linearly with the size of the regular expression. The
-also have the silent $\epsilon$-transitions. If we want to
+problem with NFAs is that the problem of deciding whether a string is
-find a path from the starting state of a NFA to an accepting
+accepted or not is computationally not cheap. Remember with NFAs we
-state, we need to consider all possibilities. In Ruby and
+have potentially many next states even for the same input and also
-Python this is done by a depth-first search, which in turn
+have the silent $\epsilon$-transitions. If we want to find a path from
-means that if a ``wrong'' choice is made, the algorithm has to
+the starting state of a NFA to an accepting state, we need to consider
-backtrack and thus explore all potential candidates. This is
+all possibilities. In Ruby, Python and Java this is done by a
-exactly the reason why Ruby and Python are so slow for evil
+depth-first search, which in turn means that if a ``wrong'' choice is
-regular expressions. An alternative to the potentially slow
+made, the algorithm has to backtrack and thus explore all potential
-depth-first search is to explore the search space in a
+candidates. This is exactly the reason why Ruby, Python and Java are
-breadth-first fashion, but this might incur a big memory
+so slow for evil regular expressions. An alternative to the
-penalty.
+potentially slow depth-first search is to explore the search space in
+a breadth-first fashion, but this might incur a big memory penalty.
-To avoid the problems with NFAs, we can translate them
-into DFAs. With DFAs the problem of deciding whether a
+To avoid the problems with NFAs, we can translate them into DFAs. With
-string is recognised or not is much simpler, because in
+DFAs the problem of deciding whether a string is recognised or not is
-each state it is completely determined what the next
+much simpler, because in each state it is completely determined what
-state will be for a given input. So no search is needed.
+the next state will be for a given input. So no search is needed.  The
-The problem with this is that the translation to DFAs
+problem with this is that the translation to DFAs can explode
-can explode exponentially the number of states. Therefore when
+exponentially the number of states. Therefore when this route is
-this route is taken, we definitely need to minimise the
+taken, we definitely need to minimise the resulting DFAs in order to
-resulting DFAs in order to have an acceptable memory
+have an acceptable memory and runtime behaviour. But remember the
-and runtime behaviour. But remember the subset construction
+subset construction in the worst case explodes the number of states by
-in the worst case explodes the number of states by $2^n$.
+$2^n$.  Effectively also the translation to DFAs can incur a big
-Effectively also the translation to DFAs can incur a big
 runtime penalty.
-But this does not mean that everything is bad with automata.
+But this does not mean that everything is bad with automata.  Recall
-Recall the problem of finding a regular expressions for the
+the problem of finding a regular expressions for the language that is
-language that is \emph{not} recognised by a regular
+\emph{not} recognised by a regular expression. In our implementation
-expression. In our implementation we added explicitly such a
+we added explicitly such a regular expressions because they are useful
-regular expressions because they are useful for recognising
+for recognising comments. But in principle we did not need to. The
-comments. But in principle we did not need to. The argument
+argument for this is as follows: take a regular expression, translate
-for this is as follows: take a regular expression, translate
 it into a NFA and then a DFA that both recognise the same
-language. Once you have the DFA it is very easy to construct
+language. Once you have the DFA it is very easy to construct the
-the automaton for the language not recognised by a DFA. If
+automaton for the language not recognised by a DFA. If the DFA is
-the DFA is completed (this is important!), then you just need
+completed (this is important!), then you just need to exchange the
-to exchange the accepting and non-accepting states. You can
+accepting and non-accepting states. You can then translate this DFA
-then translate this DFA back into a regular expression and
+back into a regular expression and that will be the regular expression
-that will be the regular expression that can match all strings
+that can match all strings the original regular expression could
-the original regular expression could \emph{not} match.
+\emph{not} match.
-It is also interesting that not all languages are regular. The
+It is also interesting that not all languages are regular. The most
-most well-known example of a language that is not regular
+well-known example of a language that is not regular consists of all
-consists of all the strings of the form
+the strings of the form
 \[a^n\,b^n\]
-\noindent meaning strings that have the same number of $a$s
+\noindent meaning strings that have the same number of $a$s and
-and $b$s. You can try, but you cannot find a regular
+$b$s. You can try, but you cannot find a regular expression for this
-expression for this language and also not an automaton. One
+language and also not an automaton. One can actually prove that there
-can actually prove that there is no regular expression nor
+is no regular expression nor automaton for this language, but again
-automaton for this language, but again that would lead us too
+that would lead us too far afield for what we want to do in this
-far afield for what we want to do in this module.
+module.
 %\section*{Further Reading}
 %Compare what a ``human expert'' would create as an automaton for the
 %regular expression $a\cdot (b + c)^*$ and what the Thomson

changeset 491	d5776c6018f0
parent 490	4fee50f38305
child 492	39b7ff2cf1bc