--- a/handouts/ho03.tex Tue Sep 01 16:00:37 2020 +0100
+++ b/handouts/ho03.tex Wed Sep 02 23:34:19 2020 +0100
@@ -6,7 +6,7 @@
\begin{document}
-\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017}
+\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2020}
\section*{Handout 3 (Finite Automata)}
@@ -131,7 +131,7 @@
\begin{figure}[p]
\small
-\lstinputlisting[numbers=left]{../progs/display/dfa.scala}
+\lstinputlisting[numbers=left,lastline=43]{../progs/automata/dfa.sc}
\caption{An implementation of DFAs in Scala using partial functions.
Note some subtleties: \texttt{deltas} implements the delta-hat
construction by lifting the (partial) transition function to lists
@@ -139,7 +139,7 @@
it can obviously go ``wrong'' in which case the \texttt{Try} in
\texttt{accepts} catches the error and returns \texttt{false}---that
means the string is not accepted. The example \texttt{delta} in
- Line 28--38 implements the DFA example shown earlier in the
+ Line 22--43 implements the DFA example shown earlier in the
handout.\label{dfa}}
\end{figure}
@@ -161,7 +161,7 @@
definition, but a function from states to booleans (this function is
supposed to return true whenever a state is final; false
otherwise). While this boolean function is different from the sets of
-states, Scala allows us to use sets for such functions (see Line 40 where
+states, Scala allows us to use sets for such functions (see Line 41 where
the DFA is initialised). Again it will become clear later on why I use
functions for final states, rather than sets.
@@ -169,8 +169,8 @@
partial functions for representing the transitions; alternatives would
have been \texttt{Maps} or even \texttt{Lists}. One of the main
advantages of using partial functions is that transitions can be quite
-nicely defined by a series of \texttt{case} statements (see Lines 28
--- 38 for an example). If you need to represent an automaton with a
+nicely defined by a series of \texttt{case} statements (see Lines 29
+-- 39 for an example). If you need to represent an automaton with a
sink state (catch-all-state), you can use Scala's pattern matching and
write something like
@@ -285,12 +285,12 @@
reachable from a single state \texttt{q} by a character~\texttt{c}. In
case there is no such state---the partial transition function is
undefined---the empty set is returned (see function
-\texttt{applyOrElse} in Lines 9 and 10). The function \texttt{nexts}
+\texttt{applyOrElse} in Lines 11 and 12). The function \texttt{nexts}
just lifts this function to sets of states.
\begin{figure}[p]
\small
-\lstinputlisting[numbers=left]{../progs/display/nfa.scala}
+\lstinputlisting[numbers=left,lastline=43]{../progs/automata/nfa.sc}
\caption{A Scala implementation of NFAs using partial functions.
Notice that the function \texttt{accepts} implements the
acceptance of a string in a breadth-first search fashion. This can be a costly
@@ -301,7 +301,7 @@
Look very careful at the \texttt{accepts} and \texttt{deltas}
functions in NFAs and remember that when accepting a string by a NFA
we might have to explore all possible transitions (recall which state
-to go to is not unique anymore with NFAs\ldots{}we need to explore
+to go to is not unique any more with NFAs\ldots{}we need to explore
potentially all next states). The implementation achieves this
exploration through a \emph{breadth-first search}. This is fine for
small NFAs, but can lead to real memory problems when the NFAs are
@@ -336,9 +336,9 @@
In order to get an idea what calculations are performed by Java \&
friends, we need a method for transforming a regular expression into
-an automaton. This automaton should accept exactly those strings that
+a corresponding automaton. This automaton should accept exactly those strings that
are accepted by the regular expression. The simplest and most
-well-known method for this is called \emph{Thompson Construction},
+well-known method for this is called the \emph{Thompson Construction},
after the Turing Award winner Ken Thompson. This method is by
recursion over regular expressions and depends on the non-determinism
in NFAs described in the previous section. You will see shortly why
@@ -409,12 +409,12 @@
are---that would require extending the transition relation of
NFAs. Next, we would show that the $\epsilon$NFAs are equivalent to
NFAs and so on. Once we have done all this on paper, we would need to
-implement $\epsilon$NFAs. Lets try to take a shortcut instead. We are
+implement $\epsilon$NFAs. Let's try to take a shortcut instead. We are
not really interested in $\epsilon$NFAs; they are only a convenient
tool for translating regular expressions into automata. So we are not
going to implementing them explicitly, but translate them immediately
into NFAs (in a sense $\epsilon$NFAs are just a convenient API for
-lazy people ;o). How does the translation work? Well we have to find
+lazy people ;o). How does this translation work? Well we have to find
all transitions of the form
\[
@@ -423,9 +423,10 @@
\stackrel{\epsilon}{\longrightarrow}\ldots\stackrel{\epsilon}{\longrightarrow} q'
\]
-\noindent where somewhere in the ``middle'' is an $a$-transition. We
-replace them with $q \stackrel{a}{\longrightarrow} q'$. Doing this to
-the $\epsilon$NFA on the right-hand side above gives the NFA
+\noindent where somewhere in the ``middle'' is an $a$-transition (for
+a character, say, $a$). We replace them with
+$q \stackrel{a}{\longrightarrow} q'$. Doing this to the $\epsilon$NFA
+on the right-hand side above gives the NFA
\begin{center}
\begin{tikzpicture}[>=stealth',very thick,
@@ -449,11 +450,11 @@
So in what follows, whenever we are given an $\epsilon$NFA we will
replace it by an equivalent NFA. The Scala code for this translation
is given in Figure~\ref{enfa}. The main workhorse in this code is a
-function that calculates a fixpoint of function (Lines 5--10). This
+function that calculates a fixpoint of function (Lines 6--12). This
function is used for ``discovering'' which states are reachable by
$\epsilon$-transitions. Once no new state is discovered, a fixpoint is
reached. This is used for example when calculating the starting states
-of an equivalent NFA (see Line 36): we start with all starting states
+of an equivalent NFA (see Line 28): we start with all starting states
of the $\epsilon$NFA and then look for all additional states that can
be reached by $\epsilon$-transitions. We keep on doing this until no
new state can be reached. This is what the $\epsilon$-closure, named
@@ -464,16 +465,16 @@
\begin{figure}[p]
\small
-\lstinputlisting[numbers=left]{../progs/display/enfa.scala}
+\lstinputlisting[numbers=left,lastline=43]{../progs/automata/enfa.sc}
\caption{A Scala function that translates $\epsilon$NFA into NFAs. The
transition function of $\epsilon$NFA takes as input an \texttt{Option[C]}.
\texttt{None} stands for an $\epsilon$-transition; \texttt{Some(c)}
for a ``proper'' transition consuming a character. The functions in
- Lines 18--26 calculate
+ Lines 19--24 calculate
all states reachable by one or more $\epsilon$-transition for a given
- set of states. The NFA is constructed in Lines 36--38.
- Note the interesting commands in Lines 5 and 6: their purpose is
+ set of states. The NFA is constructed in Lines 30--34.
+ Note the interesting commands in Lines 7 and 8: their purpose is
to ensure that \texttt{fixpT} is the tail-recursive version of
the fixpoint construction; otherwise we would quickly get a
stack-overflow exception, even on small examples, due to limitations
@@ -546,12 +547,14 @@
\begin{figure}[p]
\small
-\lstinputlisting[numbers=left]{../progs/display/thompson1.scala}
-\caption{The first part of the Thompson Construction. Lines 7--16
+\lstinputlisting[numbers=left,linerange={1-20}]{../progs/automata/thompson.sc}
+\hspace{5mm}\texttt{\dots}
+\lstinputlisting[numbers=left,linerange={28-45},firstnumber=28]{../progs/automata/thompson.sc}
+\caption{The first part of the Thompson Construction. Lines 10--19
implement a way of how to create new states that are all
distinct by virtue of a counter. This counter is
increased in the companion object of \texttt{TState}
- whenever a new state is created. The code in Lines 24--40
+ whenever a new state is created. The code in Lines 38--45
constructs NFAs for the simple regular expressions $\ZERO$, $\ONE$ and $c$.
Compare this code with the pictures given in \eqref{simplecases} on
Page~\pageref{simplecases}.
@@ -560,12 +563,12 @@
\begin{figure}[p]
\small
-\lstinputlisting[numbers=left]{../progs/display/thompson2.scala}
+\lstinputlisting[numbers=left,firstline=48,firstnumber=48,lastline=85]{../progs/automata/thompson.sc}
\caption{The second part of the Thompson Construction implementing
the composition of NFAs according to $\cdot$, $+$ and ${}^*$.
- The implicit class about rich partial functions
+ The implicit class (Lines 48--54) about rich partial functions
implements the infix operation \texttt{+++} which
- combines an $\epsilon$NFA transition with a NFA transition
+ combines an $\epsilon$NFA transition with an NFA transition
(both are given as partial functions---but with different type!).\label{thompson2}}
\end{figure}
@@ -651,25 +654,25 @@
to the second NFA; the ending of the string is recognised by the
second NFA...just like matching of a string by the regular expression
$r_1\cdot r_2$. The Scala code for this construction is given in
-Figure~\ref{thompson2} in Lines 16--23. The starting states of the
+Figure~\ref{thompson2} in Lines 57--65. The starting states of the
$\epsilon$NFA are the starting states of the first NFA (corresponding
to $r_1$); the accepting function is the accepting function of the
second NFA (corresponding to $r_2$). The new transition function is
all the ``old'' transitions plus the $\epsilon$-transitions connecting
the accepting states of the first NFA to the starting states of the
-first NFA (Lines 18 and 19). The $\epsilon$NFA is then immediately
+second NFA (Lines 59 and 60). The $\epsilon$NFA is then immediately
translated in a NFA.
The case for the alternative regular expression $r_1 + r_2$ is
slightly different: We are given by recursion two NFAs representing
$r_1$ and $r_2$ respectively. Each NFA has some starting states and
-some accepting states. We obtain a NFA for the regular expression $r_1
-+ r_2$ by composing the transition functions (this crucially depends
-on knowing that the states of each component NFA are distinct---recall
-we implemented for this to hold some bespoke code for states). We also
-need to combine the starting states and accepting functions
-appropriately.
+some accepting states. We obtain a NFA for the regular expression
+$r_1 + r_2$ by composing the transition functions (this crucially
+depends on knowing that the states of each component NFA are
+distinct---recall we implemented for this to hold by some bespoke code
+for \texttt{TState}s). We also need to combine the starting states and
+accepting functions appropriately.
\begin{center}
\begin{tabular}[t]{ccc}
@@ -734,7 +737,7 @@
\end{center}
\noindent The code for this construction is in Figure~\ref{thompson2}
-in Lines 25--33.
+in Lines 67--75.
Finally for the $*$-case we have a NFA for $r$ and connect its
accepting states to a new starting state via
@@ -788,14 +791,15 @@
\end{center}
\noindent
-The corresponding code is in Figure~\ref{thompson2} in Lines 35--43)
+The corresponding code is in Figure~\ref{thompson2} in Lines 77--85)
-To sum up, you can see in the sequence and star cases the need of
-having silent $\epsilon$-transitions. Similarly the alternative case
-shows the need of the NFA-nondeterminism. It seems awkward to form the
-`alternative' composition of two DFAs, because DFA do not allow
-several starting and successor states. All these constructions can now
-be put together in the following recursive function:
+To sum up, you can see in the sequence and star cases the need for
+silent $\epsilon$-transitions. Otherwise this construction just
+becomes awkward. Similarly the alternative case shows the need of the
+NFA-nondeterminism. It looks non-obvious to form the `alternative'
+composition of two DFAs, because DFA do not allow several starting and
+successor states. All these constructions can now be put together in
+the following recursive function:
{\small\begin{lstlisting}[language=Scala]
@@ -813,13 +817,15 @@
It calculates a NFA from a regular expressions. At last we can run
NFAs for the our evil regular expression examples. The graph on the
left shows that when translating a regular expression such as
-$a^{\{n\}}$ into a NFA, the size can blow up and then even the
-relative fast (on small examples) breadth-first search can be
-slow. The graph on the right shows that with `evil' regular
-expressions the depth-first search can be abysmally slow. Even if the
-graphs not completely overlap with the curves of Python, Ruby and
-Java, they are similar enough. OK\ldots now you know why regular
-expression matchers in those languages are so slow.
+$a^{?\{n\}} \cdot a^{\{n\}}$ into a NFA, the size can blow up and then
+even the relative fast (on small examples) breadth-first search can be
+slow\ldots the red line maxes out at about 15 $n$s.
+
+
+The graph on the right shows that with `evil' regular expressions also
+the depth-first search can be abysmally slow. Even if the graphs not
+completely overlap with the curves of Python, Ruby and Java, they are
+similar enough.
\begin{center}
@@ -882,7 +888,7 @@
axis lines=left,
width=5.5cm,
height=4cm,
- legend entries={Python, Java, depth-first NFA},
+ legend entries={Python, Java 8, depth-first NFA},
legend style={at={(0.5,-0.25)},anchor=north,font=\small},
legend cell align=left]
\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
@@ -902,7 +908,9 @@
\end{tabular}
\end{center}
-
+\noindent
+OK\ldots now you know why regular expression matchers in those
+languages are sometimes so slow. A bit surprising, don't you agree?
\subsection*{Subset Construction}
@@ -931,7 +939,8 @@
that also recognises the same language. This might sound a bit
paradoxical: NFA $\rightarrow$ decision of acceptance hard; DFA
$\rightarrow$ decision easy. But this \emph{is} true\ldots but of course
-there is always a caveat---nothing ever is for free in life.
+there is always a caveat---nothing ever is for free in life. Let's see what this
+actually means.
There are actually a number of methods for transforming a NFA into
an equivalent DFA, but the most famous one is the \emph{subset
@@ -1196,7 +1205,7 @@
In Step 3 we need to fill in more stars according whether
one of the next-state pairs are marked. We have to do this
-for every unmarked field until there is no change anymore.
+for every unmarked field until there is no change any more.
This gives the triangle
\begin{center}
@@ -1255,6 +1264,22 @@
\end{center}
+\noindent This minimised DFA is certainly fast when it comes deciding
+whether a string is accepted or not. But this is not universally the
+case. Suppose you count the nodes in a regular expression (when
+represented as tree). If you look carefully at the Thompson
+Construction you can see that the constructed NFA has states that grow
+linearly in terms of the size of the regular expression. This is good,
+but as we have seen earlier deciding whether a string is matched by an
+NFA is hard. Translating an NFA into a DFA means deciding whether a
+string is matched by a DFA is easy, but the number of states can grow
+exponentially, even after minimisation. Say a NFA has $n$ states, then
+in the worst case the corresponding minimal DFA that can match the
+same language as the NFA might contain $2^n$ of states. Unfortunately
+in many interesting cases this worst case bound is the dominant
+factor.
+
+
By the way, we are not bothering with implementing the above
minimisation algorithm: while up to now all the transformations used
some clever composition of functions, the minimisation algorithm
@@ -1262,7 +1287,8 @@
would require a more concrete representation of the transition
function (like maps). If we did this, however, then many advantages of
the functions would be thrown away. So the compromise is to not being
-able to minimise (easily) our DFAs.
+able to minimise (easily) our DFAs. We want to use regular expressions
+directly anyway.
\subsection*{Brzozowski's Method}