afl-material: comparison handouts/ho03.tex

equal deleted inserted replaced

-:c0bdd4ad69ca
+:d94fdbef1a4f
 \usepackage{../langs}
 \usepackage{../graphics}
 \begin{document}
-\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017}
+\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2020}
 \section*{Handout 3 (Finite Automata)}
 Every formal language and compiler course I know of bombards you first
 \noindent I let you think about a definition that describes the set of
 all strings accepted by a deterministic finite automaton.
 \begin{figure}[p]
 \small
-\lstinputlisting[numbers=left]{../progs/display/dfa.scala}
+\lstinputlisting[numbers=left,lastline=43]{../progs/automata/dfa.sc}
 \caption{An implementation of DFAs in Scala using partial functions.
 Note some subtleties: \texttt{deltas} implements the delta-hat
 construction by lifting the (partial) transition  function to lists
 of characters. Since \texttt{delta} is given as a partial function,
 it can obviously go ``wrong'' in which case the \texttt{Try} in
 \texttt{accepts} catches the error and returns \texttt{false}---that
 means the string is not accepted.  The example \texttt{delta} in
-Line 28--38 implements the DFA example shown earlier in the
+Line 22--43 implements the DFA example shown earlier in the
 handout.\label{dfa}}
 \end{figure}
 My take on an implementation of DFAs in Scala is given in
 Figure~\ref{dfa}. As you can see, there are many features of the
 The DFA-class has also an argument for specifying final states. In the
 implementation it is not a set of states, as in the mathematical
 definition, but a function from states to booleans (this function is
 supposed to return true whenever a state is final; false
 otherwise). While this boolean function is different from the sets of
-states, Scala allows us to use sets for such functions (see Line 40 where
+states, Scala allows us to use sets for such functions (see Line 41 where
 the DFA is initialised). Again it will become clear later on why I use
 functions for final states, rather than sets.
 The most important point in the implementation is that I use Scala's
 partial functions for representing the transitions; alternatives would
 have been \texttt{Maps} or even \texttt{Lists}. One of the main
 advantages of using partial functions is that transitions can be quite
-nicely defined by a series of \texttt{case} statements (see Lines 28
+nicely defined by a series of \texttt{case} statements (see Lines 29
--- 38 for an example). If you need to represent an automaton with a
+-- 39 for an example). If you need to represent an automaton with a
 sink state (catch-all-state), you can use Scala's pattern matching and
 write something like
 {\small\begin{lstlisting}[language=Scala]
 abstract class State
 NFAs a set of states; \texttt{fins} is again a function from states to
 booleans. The \texttt{next} function calculates the set of next states
 reachable from a single state \texttt{q} by a character~\texttt{c}. In
 case there is no such state---the partial transition function is
 undefined---the empty set is returned (see function
-\texttt{applyOrElse} in Lines 9 and 10). The function \texttt{nexts}
+\texttt{applyOrElse} in Lines 11 and 12). The function \texttt{nexts}
 just lifts this function to sets of states.
 \begin{figure}[p]
 \small
-\lstinputlisting[numbers=left]{../progs/display/nfa.scala}
+\lstinputlisting[numbers=left,lastline=43]{../progs/automata/nfa.sc}
 \caption{A Scala implementation of NFAs using partial functions.
 Notice that the function \texttt{accepts} implements the
 acceptance of a string in a breadth-first search fashion. This can be a costly
 way of deciding whether a string is accepted or not in applications that need to handle
 large NFAs and large inputs.\label{nfa}}
 \end{figure}
 Look very careful at the \texttt{accepts} and \texttt{deltas}
 functions in NFAs and remember that when accepting a string by a NFA
 we might have to explore all possible transitions (recall which state
-to go to is not unique anymore with NFAs\ldots{}we need to explore
+to go to is not unique any more with NFAs\ldots{}we need to explore
 potentially all next states). The implementation achieves this
 exploration through a \emph{breadth-first search}. This is fine for
 small NFAs, but can lead to real memory problems when the NFAs are
 bigger and larger strings need to be processed. As result, some
 regular expression matching engines resort to a \emph{depth-first
 \subsection*{Epsilon NFAs}
 In order to get an idea what calculations are performed by Java \&
 friends, we need a method for transforming a regular expression into
-an automaton. This automaton should accept exactly those strings that
+a corresponding automaton. This automaton should accept exactly those strings that
 are accepted by the regular expression.  The simplest and most
-well-known method for this is called \emph{Thompson Construction},
+well-known method for this is called the \emph{Thompson Construction},
 after the Turing Award winner Ken Thompson. This method is by
 recursion over regular expressions and depends on the non-determinism
 in NFAs described in the previous section. You will see shortly why
 this construction works well with NFAs, but is not so straightforward
 with DFAs.
 unfortunately.  If we were to follow the many textbooks on the
 subject, we would now start with defining what $\epsilon$NFAs
 are---that would require extending the transition relation of
 NFAs. Next, we would show that the $\epsilon$NFAs are equivalent to
 NFAs and so on. Once we have done all this on paper, we would need to
-implement $\epsilon$NFAs. Lets try to take a shortcut instead. We are
+implement $\epsilon$NFAs. Let's try to take a shortcut instead. We are
 not really interested in $\epsilon$NFAs; they are only a convenient
 tool for translating regular expressions into automata. So we are not
 going to implementing them explicitly, but translate them immediately
 into NFAs (in a sense $\epsilon$NFAs are just a convenient API for
-lazy people ;o).  How does the translation work? Well we have to find
+lazy people ;o).  How does this translation work? Well we have to find
 all transitions of the form
 \[
 q\stackrel{\epsilon}{\longrightarrow}\ldots\stackrel{\epsilon}{\longrightarrow}
 \;\stackrel{a}{\longrightarrow}\;
 \stackrel{\epsilon}{\longrightarrow}\ldots\stackrel{\epsilon}{\longrightarrow} q'
 \]
-\noindent where somewhere in the ``middle'' is an $a$-transition. We
+\noindent where somewhere in the ``middle'' is an $a$-transition (for
-replace them with $q \stackrel{a}{\longrightarrow} q'$. Doing this to
+a character, say, $a$). We replace them with
-the $\epsilon$NFA on the right-hand side above gives the NFA
+$q \stackrel{a}{\longrightarrow} q'$. Doing this to the $\epsilon$NFA
+on the right-hand side above gives the NFA
 \begin{center}
 \begin{tikzpicture}[>=stealth',very thick,
 every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]
 \node[state,initial]  (r_1)  {$R_1$};
 and verify that I did not forget any transition.
 So in what follows, whenever we are given an $\epsilon$NFA we will
 replace it by an equivalent NFA. The Scala code for this translation
 is given in Figure~\ref{enfa}. The main workhorse in this code is a
-function that calculates a fixpoint of function (Lines 5--10). This
+function that calculates a fixpoint of function (Lines 6--12). This
 function is used for ``discovering'' which states are reachable by
 $\epsilon$-transitions. Once no new state is discovered, a fixpoint is
 reached. This is used for example when calculating the starting states
-of an equivalent NFA (see Line 36): we start with all starting states
+of an equivalent NFA (see Line 28): we start with all starting states
 of the $\epsilon$NFA and then look for all additional states that can
 be reached by $\epsilon$-transitions. We keep on doing this until no
 new state can be reached. This is what the $\epsilon$-closure, named
 in the code \texttt{ecl}, calculates. Similarly, an accepting state of
 the NFA is when we can reach an accepting state of the $\epsilon$NFA
 by $\epsilon$-transitions.
 \begin{figure}[p]
 \small
-\lstinputlisting[numbers=left]{../progs/display/enfa.scala}
+\lstinputlisting[numbers=left,lastline=43]{../progs/automata/enfa.sc}
 \caption{A Scala function that translates $\epsilon$NFA into NFAs. The
 transition function of $\epsilon$NFA takes as input an \texttt{Option[C]}.
 \texttt{None} stands for an $\epsilon$-transition; \texttt{Some(c)}
 for a ``proper'' transition consuming a character. The functions in
-Lines 18--26 calculate
+Lines 19--24 calculate
 all states reachable by one or more $\epsilon$-transition for a given
-set of states. The NFA is constructed in Lines 36--38.
+set of states. The NFA is constructed in Lines 30--34.
-Note the interesting commands in Lines 5 and 6: their purpose is
+Note the interesting commands in Lines 7 and 8: their purpose is
 to ensure that \texttt{fixpT} is the tail-recursive version of
 the fixpoint construction; otherwise we would quickly get a
 stack-overflow exception, even on small examples, due to limitations
 of the JVM.
 \label{enfa}}
 increases this counter whenever a new state is created.\footnote{You might
 have to read up what \emph{companion objects} do in Scala.}
 \begin{figure}[p]
 \small
-\lstinputlisting[numbers=left]{../progs/display/thompson1.scala}
+\lstinputlisting[numbers=left,linerange={1-20}]{../progs/automata/thompson.sc}
-\caption{The first part of the Thompson Construction. Lines 7--16
+\hspace{5mm}\texttt{\dots}
+\lstinputlisting[numbers=left,linerange={28-45},firstnumber=28]{../progs/automata/thompson.sc}
+\caption{The first part of the Thompson Construction. Lines 10--19
 implement a way of how to create new states that are all
 distinct by virtue of a counter. This counter is
 increased in the companion object of \texttt{TState}
-whenever a new state is created. The code in Lines 24--40
+whenever a new state is created. The code in Lines 38--45
 constructs NFAs for the simple regular expressions $\ZERO$, $\ONE$ and $c$.
 Compare this code with the pictures given in \eqref{simplecases} on
 Page~\pageref{simplecases}.
 \label{thompson1}}
 \end{figure}
 \begin{figure}[p]
 \small
-\lstinputlisting[numbers=left]{../progs/display/thompson2.scala}
+\lstinputlisting[numbers=left,firstline=48,firstnumber=48,lastline=85]{../progs/automata/thompson.sc}
 \caption{The second part of the Thompson Construction implementing
 the composition of NFAs according to $\cdot$, $+$ and ${}^*$.
-The implicit class about rich partial functions
+The implicit class (Lines 48--54) about rich partial functions
 implements the infix operation \texttt{+++} which
-combines an $\epsilon$NFA transition with a NFA transition
+combines an $\epsilon$NFA transition with an NFA transition
 (both are given as partial functions---but with different type!).\label{thompson2}}
 \end{figure}
 The case for the sequence regular expression $r_1 \cdot r_2$ is a bit more
 complicated: Say, we are given by recursion two NFAs representing the regular
 \noindent The idea behind this construction is that the start of any
 string is first recognised by the first NFA, then we silently change
 to the second NFA; the ending of the string is recognised by the
 second NFA...just like matching of a string by the regular expression
 $r_1\cdot r_2$. The Scala code for this construction is given in
-Figure~\ref{thompson2} in Lines 16--23. The starting states of the
+Figure~\ref{thompson2} in Lines 57--65. The starting states of the
 $\epsilon$NFA are the starting states of the first NFA (corresponding
 to $r_1$); the accepting function is the accepting function of the
 second NFA (corresponding to $r_2$). The new transition function is
 all the ``old'' transitions plus the $\epsilon$-transitions connecting
 the accepting states of the first NFA to the starting states of the
-first NFA (Lines 18 and 19). The $\epsilon$NFA is then immediately
+second NFA (Lines 59 and 60). The $\epsilon$NFA is then immediately
 translated in a NFA.
 The case for the alternative regular expression $r_1 + r_2$ is
 slightly different: We are given by recursion two NFAs representing
 $r_1$ and $r_2$ respectively. Each NFA has some starting states and
-some accepting states. We obtain a NFA for the regular expression $r_1
+some accepting states. We obtain a NFA for the regular expression
-+ r_2$ by composing the transition functions (this crucially depends
+$r_1 + r_2$ by composing the transition functions (this crucially
-on knowing that the states of each component NFA are distinct---recall
+depends on knowing that the states of each component NFA are
-we implemented for this to hold some bespoke code for states). We also
+distinct---recall we implemented for this to hold by some bespoke code
-need to combine the starting states and accepting functions
+for \texttt{TState}s). We also need to combine the starting states and
-appropriately.
+accepting functions appropriately.
 \begin{center}
 \begin{tabular}[t]{ccc}
 \begin{tikzpicture}[node distance=3mm,
 >=stealth',very thick,
 \end{tikzpicture}
 \end{tabular}
 \end{center}
 \noindent The code for this construction is in Figure~\ref{thompson2}
-in Lines 25--33.
+in Lines 67--75.
 Finally for the $*$-case we have a NFA for $r$ and connect its
 accepting states to a new starting state via
 $\epsilon$-transitions. This new starting state is also an accepting
 state, because $r^*$ can recognise the empty string.
 \end{tikzpicture}
 \end{tabular}
 \end{center}
 \noindent
-The corresponding code is in Figure~\ref{thompson2} in Lines 35--43)
+The corresponding code is in Figure~\ref{thompson2} in Lines 77--85)
-To sum up, you can see in the sequence and star cases the need of
+To sum up, you can see in the sequence and star cases the need for
-having silent $\epsilon$-transitions. Similarly the alternative case
+silent $\epsilon$-transitions. Otherwise this construction just
-shows the need of the NFA-nondeterminism. It seems awkward to form the
+becomes awkward. Similarly the alternative case shows the need of the
-`alternative' composition of two DFAs, because DFA do not allow
+NFA-nondeterminism. It looks non-obvious to form the `alternative'
-several starting and successor states. All these constructions can now
+composition of two DFAs, because DFA do not allow several starting and
-be put together in the following recursive function:
+successor states. All these constructions can now be put together in
+the following recursive function:
 {\small\begin{lstlisting}[language=Scala]
 def thompson(r: Rexp) : NFAt = r match {
 case ZERO => NFA_ZERO()
 \noindent
 It calculates a NFA from a regular expressions. At last we can run
 NFAs for the our evil regular expression examples.  The graph on the
 left shows that when translating a regular expression such as
-$a^{\{n\}}$ into a NFA, the size can blow up and then even the
+$a^{?\{n\}} \cdot a^{\{n\}}$ into a NFA, the size can blow up and then
-relative fast (on small examples) breadth-first search can be
+even the relative fast (on small examples) breadth-first search can be
-slow. The graph on the right shows that with `evil' regular
+slow\ldots the red line maxes out at about 15 $n$s.
-expressions the depth-first search can be abysmally slow. Even if the
-graphs not completely overlap with the curves of Python, Ruby and
-Java, they are similar enough. OK\ldots now you know why regular
+The graph on the right shows that with `evil' regular expressions also
-expression matchers in those languages are so slow.
+the depth-first search can be abysmally slow. Even if the graphs not
+completely overlap with the curves of Python, Ruby and Java, they are
+similar enough.
 \begin{center}
 \begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1mm}}c@{}}
 \begin{tikzpicture}
 ytick={0,5,...,30},
 scaled ticks=false,
 axis lines=left,
 width=5.5cm,
 height=4cm,
-legend entries={Python, Java, depth-first NFA},
+legend entries={Python, Java 8, depth-first NFA},
 legend style={at={(0.5,-0.25)},anchor=north,font=\small},
 legend cell align=left]
 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
 % depth-first search in NFAs
 \end{axis}
 \end{tikzpicture}
 \end{tabular}
 \end{center}
+\noindent
+OK\ldots now you know why regular expression matchers in those
+languages are sometimes so slow. A bit surprising, don't you agree?
 \subsection*{Subset Construction}
 Of course, some developers of regular expression matchers are aware of
 these problems with sluggish NFAs and try to address them. One common
 a single state. So with DFAs we do not have to search at all.  What is
 perhaps interesting is the fact that for every NFA we can find a DFA
 that also recognises the same language. This might sound a bit
 paradoxical: NFA $\rightarrow$ decision of acceptance hard; DFA
 $\rightarrow$ decision easy. But this \emph{is} true\ldots but of course
-there is always a caveat---nothing ever is for free in life.
+there is always a caveat---nothing ever is for free in life. Let's see what this
+actually means.
 There are actually a number of methods for transforming a NFA into
 an equivalent DFA, but the most famous one is the \emph{subset
 construction}. Consider the following NFA where the states are
 labelled with $0$, $1$ and $2$.
 accepting ($Q_4$) and a state that is non-accepting (the other
 states).
 In Step 3 we need to fill in more stars according whether
 one of the next-state pairs are marked. We have to do this
-for every unmarked field until there is no change anymore.
+for every unmarked field until there is no change any more.
 This gives the triangle
 \begin{center}
 \begin{tikzpicture}[scale=0.6,line width=0.8mm]
 \draw (0,0) -- (4,0);
 \path[->] (Q_4) edge [loop above] node  {$a, b$} ();
 \end{tikzpicture}
 \end{center}
+\noindent This minimised DFA is certainly fast when it comes deciding
+whether a string is accepted or not. But this is not universally the
+case.  Suppose you count the nodes in a regular expression (when
+represented as tree).  If you look carefully at the Thompson
+Construction you can see that the constructed NFA has states that grow
+linearly in terms of the size of the regular expression. This is good,
+but as we have seen earlier deciding whether a string is matched by an
+NFA is hard. Translating an NFA into a DFA means deciding whether a
+string is matched by a DFA is easy, but the number of states can grow
+exponentially, even after minimisation. Say a NFA has $n$ states, then
+in the worst case the corresponding minimal DFA that can match the
+same language as the NFA might contain $2^n$ of states. Unfortunately
+in many interesting cases this worst case bound is the dominant
+factor.
 By the way, we are not bothering with implementing the above
 minimisation algorithm: while up to now all the transformations used
 some clever composition of functions, the minimisation algorithm
 cannot be implemented by just composing some functions. For this we
 would require a more concrete representation of the transition
 function (like maps). If we did this, however, then many advantages of
 the functions would be thrown away. So the compromise is to not being
-able to minimise (easily) our DFAs.
+able to minimise (easily) our DFAs. We want to use regular expressions
+directly anyway.
 \subsection*{Brzozowski's Method}
 I know this handout is already a long, long rant: but after all it is
 a topic that has been researched for more than 60 years. If you

changeset 753	d94fdbef1a4f
parent 698	7c7854feccb5
child 764	6718ef6143b8