updated
authorChristian Urban <christian.urban@kcl.ac.uk>
Wed, 02 Sep 2020 23:34:19 +0100
changeset 753 d94fdbef1a4f
parent 752 c0bdd4ad69ca
child 754 1c9a23304b85
updated
handouts/ho03.pdf
handouts/ho03.tex
progs/automata/dfa.sc
progs/automata/enfa.sc
progs/automata/nfa.sc
progs/automata/thompson.sc
progs/catastrophic/catastrophic.js
progs/catastrophic/catastrophic.py
progs/catastrophic/catastrophic.rb
progs/catastrophic/catastrophic2.py
Binary file handouts/ho03.pdf has changed
--- a/handouts/ho03.tex	Tue Sep 01 16:00:37 2020 +0100
+++ b/handouts/ho03.tex	Wed Sep 02 23:34:19 2020 +0100
@@ -6,7 +6,7 @@
 
 
 \begin{document}
-\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017}
+\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2020}
 
 \section*{Handout 3 (Finite Automata)}
 
@@ -131,7 +131,7 @@
 
 \begin{figure}[p]
 \small
-\lstinputlisting[numbers=left]{../progs/display/dfa.scala}
+\lstinputlisting[numbers=left,lastline=43]{../progs/automata/dfa.sc}
 \caption{An implementation of DFAs in Scala using partial functions.
   Note some subtleties: \texttt{deltas} implements the delta-hat
   construction by lifting the (partial) transition  function to lists
@@ -139,7 +139,7 @@
   it can obviously go ``wrong'' in which case the \texttt{Try} in
   \texttt{accepts} catches the error and returns \texttt{false}---that
   means the string is not accepted.  The example \texttt{delta} in
-  Line 28--38 implements the DFA example shown earlier in the
+  Line 22--43 implements the DFA example shown earlier in the
   handout.\label{dfa}}
 \end{figure}
 
@@ -161,7 +161,7 @@
 definition, but a function from states to booleans (this function is
 supposed to return true whenever a state is final; false
 otherwise). While this boolean function is different from the sets of
-states, Scala allows us to use sets for such functions (see Line 40 where
+states, Scala allows us to use sets for such functions (see Line 41 where
 the DFA is initialised). Again it will become clear later on why I use
 functions for final states, rather than sets.
 
@@ -169,8 +169,8 @@
 partial functions for representing the transitions; alternatives would
 have been \texttt{Maps} or even \texttt{Lists}. One of the main
 advantages of using partial functions is that transitions can be quite
-nicely defined by a series of \texttt{case} statements (see Lines 28
--- 38 for an example). If you need to represent an automaton with a
+nicely defined by a series of \texttt{case} statements (see Lines 29
+-- 39 for an example). If you need to represent an automaton with a
 sink state (catch-all-state), you can use Scala's pattern matching and
 write something like
 
@@ -285,12 +285,12 @@
 reachable from a single state \texttt{q} by a character~\texttt{c}. In
 case there is no such state---the partial transition function is
 undefined---the empty set is returned (see function
-\texttt{applyOrElse} in Lines 9 and 10). The function \texttt{nexts}
+\texttt{applyOrElse} in Lines 11 and 12). The function \texttt{nexts}
 just lifts this function to sets of states.
  
 \begin{figure}[p]
 \small
-\lstinputlisting[numbers=left]{../progs/display/nfa.scala}
+\lstinputlisting[numbers=left,lastline=43]{../progs/automata/nfa.sc}
 \caption{A Scala implementation of NFAs using partial functions.
   Notice that the function \texttt{accepts} implements the
   acceptance of a string in a breadth-first search fashion. This can be a costly
@@ -301,7 +301,7 @@
 Look very careful at the \texttt{accepts} and \texttt{deltas}
 functions in NFAs and remember that when accepting a string by a NFA
 we might have to explore all possible transitions (recall which state
-to go to is not unique anymore with NFAs\ldots{}we need to explore
+to go to is not unique any more with NFAs\ldots{}we need to explore
 potentially all next states). The implementation achieves this
 exploration through a \emph{breadth-first search}. This is fine for
 small NFAs, but can lead to real memory problems when the NFAs are
@@ -336,9 +336,9 @@
 
 In order to get an idea what calculations are performed by Java \&
 friends, we need a method for transforming a regular expression into
-an automaton. This automaton should accept exactly those strings that
+a corresponding automaton. This automaton should accept exactly those strings that
 are accepted by the regular expression.  The simplest and most
-well-known method for this is called \emph{Thompson Construction},
+well-known method for this is called the \emph{Thompson Construction},
 after the Turing Award winner Ken Thompson. This method is by
 recursion over regular expressions and depends on the non-determinism
 in NFAs described in the previous section. You will see shortly why
@@ -409,12 +409,12 @@
 are---that would require extending the transition relation of
 NFAs. Next, we would show that the $\epsilon$NFAs are equivalent to
 NFAs and so on. Once we have done all this on paper, we would need to
-implement $\epsilon$NFAs. Lets try to take a shortcut instead. We are
+implement $\epsilon$NFAs. Let's try to take a shortcut instead. We are
 not really interested in $\epsilon$NFAs; they are only a convenient
 tool for translating regular expressions into automata. So we are not
 going to implementing them explicitly, but translate them immediately
 into NFAs (in a sense $\epsilon$NFAs are just a convenient API for
-lazy people ;o).  How does the translation work? Well we have to find
+lazy people ;o).  How does this translation work? Well we have to find
 all transitions of the form
 
 \[
@@ -423,9 +423,10 @@
 \stackrel{\epsilon}{\longrightarrow}\ldots\stackrel{\epsilon}{\longrightarrow} q'
 \]
 
-\noindent where somewhere in the ``middle'' is an $a$-transition. We
-replace them with $q \stackrel{a}{\longrightarrow} q'$. Doing this to
-the $\epsilon$NFA on the right-hand side above gives the NFA
+\noindent where somewhere in the ``middle'' is an $a$-transition (for
+a character, say, $a$). We replace them with
+$q \stackrel{a}{\longrightarrow} q'$. Doing this to the $\epsilon$NFA
+on the right-hand side above gives the NFA
 
 \begin{center}
 \begin{tikzpicture}[>=stealth',very thick,
@@ -449,11 +450,11 @@
 So in what follows, whenever we are given an $\epsilon$NFA we will
 replace it by an equivalent NFA. The Scala code for this translation
 is given in Figure~\ref{enfa}. The main workhorse in this code is a
-function that calculates a fixpoint of function (Lines 5--10). This
+function that calculates a fixpoint of function (Lines 6--12). This
 function is used for ``discovering'' which states are reachable by
 $\epsilon$-transitions. Once no new state is discovered, a fixpoint is
 reached. This is used for example when calculating the starting states
-of an equivalent NFA (see Line 36): we start with all starting states
+of an equivalent NFA (see Line 28): we start with all starting states
 of the $\epsilon$NFA and then look for all additional states that can
 be reached by $\epsilon$-transitions. We keep on doing this until no
 new state can be reached. This is what the $\epsilon$-closure, named
@@ -464,16 +465,16 @@
 
 \begin{figure}[p]
 \small
-\lstinputlisting[numbers=left]{../progs/display/enfa.scala}
+\lstinputlisting[numbers=left,lastline=43]{../progs/automata/enfa.sc}
 
 \caption{A Scala function that translates $\epsilon$NFA into NFAs. The
   transition function of $\epsilon$NFA takes as input an \texttt{Option[C]}.
   \texttt{None} stands for an $\epsilon$-transition; \texttt{Some(c)}
   for a ``proper'' transition consuming a character. The functions in
-  Lines 18--26 calculate
+  Lines 19--24 calculate
   all states reachable by one or more $\epsilon$-transition for a given
-  set of states. The NFA is constructed in Lines 36--38.
-  Note the interesting commands in Lines 5 and 6: their purpose is
+  set of states. The NFA is constructed in Lines 30--34.
+  Note the interesting commands in Lines 7 and 8: their purpose is
   to ensure that \texttt{fixpT} is the tail-recursive version of
   the fixpoint construction; otherwise we would quickly get a
   stack-overflow exception, even on small examples, due to limitations
@@ -546,12 +547,14 @@
 
 \begin{figure}[p]
 \small
-\lstinputlisting[numbers=left]{../progs/display/thompson1.scala}
-\caption{The first part of the Thompson Construction. Lines 7--16
+\lstinputlisting[numbers=left,linerange={1-20}]{../progs/automata/thompson.sc}
+\hspace{5mm}\texttt{\dots}
+\lstinputlisting[numbers=left,linerange={28-45},firstnumber=28]{../progs/automata/thompson.sc}
+\caption{The first part of the Thompson Construction. Lines 10--19
   implement a way of how to create new states that are all
   distinct by virtue of a counter. This counter is
   increased in the companion object of \texttt{TState}
-  whenever a new state is created. The code in Lines 24--40
+  whenever a new state is created. The code in Lines 38--45
   constructs NFAs for the simple regular expressions $\ZERO$, $\ONE$ and $c$.
   Compare this code with the pictures given in \eqref{simplecases} on
   Page~\pageref{simplecases}.
@@ -560,12 +563,12 @@
 
 \begin{figure}[p]
 \small
-\lstinputlisting[numbers=left]{../progs/display/thompson2.scala}
+\lstinputlisting[numbers=left,firstline=48,firstnumber=48,lastline=85]{../progs/automata/thompson.sc}
 \caption{The second part of the Thompson Construction implementing
   the composition of NFAs according to $\cdot$, $+$ and ${}^*$.
-  The implicit class about rich partial functions
+  The implicit class (Lines 48--54) about rich partial functions
   implements the infix operation \texttt{+++} which
-  combines an $\epsilon$NFA transition with a NFA transition
+  combines an $\epsilon$NFA transition with an NFA transition
   (both are given as partial functions---but with different type!).\label{thompson2}}
 \end{figure}
 
@@ -651,25 +654,25 @@
 to the second NFA; the ending of the string is recognised by the
 second NFA...just like matching of a string by the regular expression
 $r_1\cdot r_2$. The Scala code for this construction is given in
-Figure~\ref{thompson2} in Lines 16--23. The starting states of the
+Figure~\ref{thompson2} in Lines 57--65. The starting states of the
 $\epsilon$NFA are the starting states of the first NFA (corresponding
 to $r_1$); the accepting function is the accepting function of the
 second NFA (corresponding to $r_2$). The new transition function is
 all the ``old'' transitions plus the $\epsilon$-transitions connecting
 the accepting states of the first NFA to the starting states of the
-first NFA (Lines 18 and 19). The $\epsilon$NFA is then immediately
+second NFA (Lines 59 and 60). The $\epsilon$NFA is then immediately
 translated in a NFA.
 
 
 The case for the alternative regular expression $r_1 + r_2$ is
 slightly different: We are given by recursion two NFAs representing
 $r_1$ and $r_2$ respectively. Each NFA has some starting states and
-some accepting states. We obtain a NFA for the regular expression $r_1
-+ r_2$ by composing the transition functions (this crucially depends
-on knowing that the states of each component NFA are distinct---recall
-we implemented for this to hold some bespoke code for states). We also
-need to combine the starting states and accepting functions
-appropriately.
+some accepting states. We obtain a NFA for the regular expression
+$r_1 + r_2$ by composing the transition functions (this crucially
+depends on knowing that the states of each component NFA are
+distinct---recall we implemented for this to hold by some bespoke code
+for \texttt{TState}s). We also need to combine the starting states and
+accepting functions appropriately.
 
 \begin{center}
 \begin{tabular}[t]{ccc}
@@ -734,7 +737,7 @@
 \end{center}
 
 \noindent The code for this construction is in Figure~\ref{thompson2}
-in Lines 25--33.
+in Lines 67--75.
 
 Finally for the $*$-case we have a NFA for $r$ and connect its
 accepting states to a new starting state via
@@ -788,14 +791,15 @@
 \end{center}
 
 \noindent 
-The corresponding code is in Figure~\ref{thompson2} in Lines 35--43)
+The corresponding code is in Figure~\ref{thompson2} in Lines 77--85)
 
-To sum up, you can see in the sequence and star cases the need of
-having silent $\epsilon$-transitions. Similarly the alternative case
-shows the need of the NFA-nondeterminism. It seems awkward to form the
-`alternative' composition of two DFAs, because DFA do not allow
-several starting and successor states. All these constructions can now
-be put together in the following recursive function:
+To sum up, you can see in the sequence and star cases the need for
+silent $\epsilon$-transitions. Otherwise this construction just
+becomes awkward. Similarly the alternative case shows the need of the
+NFA-nondeterminism. It looks non-obvious to form the `alternative'
+composition of two DFAs, because DFA do not allow several starting and
+successor states. All these constructions can now be put together in
+the following recursive function:
 
 
 {\small\begin{lstlisting}[language=Scala]
@@ -813,13 +817,15 @@
 It calculates a NFA from a regular expressions. At last we can run
 NFAs for the our evil regular expression examples.  The graph on the
 left shows that when translating a regular expression such as
-$a^{\{n\}}$ into a NFA, the size can blow up and then even the
-relative fast (on small examples) breadth-first search can be
-slow. The graph on the right shows that with `evil' regular
-expressions the depth-first search can be abysmally slow. Even if the
-graphs not completely overlap with the curves of Python, Ruby and
-Java, they are similar enough. OK\ldots now you know why regular
-expression matchers in those languages are so slow. 
+$a^{?\{n\}} \cdot a^{\{n\}}$ into a NFA, the size can blow up and then
+even the relative fast (on small examples) breadth-first search can be
+slow\ldots the red line maxes out at about 15 $n$s.
+
+
+The graph on the right shows that with `evil' regular expressions also
+the depth-first search can be abysmally slow. Even if the graphs not
+completely overlap with the curves of Python, Ruby and Java, they are
+similar enough. 
 
 
 \begin{center}
@@ -882,7 +888,7 @@
     axis lines=left,
     width=5.5cm,
     height=4cm, 
-    legend entries={Python, Java, depth-first NFA},
+    legend entries={Python, Java 8, depth-first NFA},
     legend style={at={(0.5,-0.25)},anchor=north,font=\small},
     legend cell align=left]
   \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
@@ -902,7 +908,9 @@
 \end{tabular}
 \end{center}
 
-
+\noindent
+OK\ldots now you know why regular expression matchers in those
+languages are sometimes so slow. A bit surprising, don't you agree?
 
 \subsection*{Subset Construction}
 
@@ -931,7 +939,8 @@
 that also recognises the same language. This might sound a bit
 paradoxical: NFA $\rightarrow$ decision of acceptance hard; DFA
 $\rightarrow$ decision easy. But this \emph{is} true\ldots but of course
-there is always a caveat---nothing ever is for free in life.
+there is always a caveat---nothing ever is for free in life. Let's see what this
+actually means.
 
 There are actually a number of methods for transforming a NFA into
 an equivalent DFA, but the most famous one is the \emph{subset
@@ -1196,7 +1205,7 @@
 
 In Step 3 we need to fill in more stars according whether 
 one of the next-state pairs are marked. We have to do this 
-for every unmarked field until there is no change anymore.
+for every unmarked field until there is no change any more.
 This gives the triangle
 
 \begin{center}
@@ -1255,6 +1264,22 @@
 \end{center}
 
 
+\noindent This minimised DFA is certainly fast when it comes deciding
+whether a string is accepted or not. But this is not universally the
+case.  Suppose you count the nodes in a regular expression (when
+represented as tree).  If you look carefully at the Thompson
+Construction you can see that the constructed NFA has states that grow
+linearly in terms of the size of the regular expression. This is good,
+but as we have seen earlier deciding whether a string is matched by an
+NFA is hard. Translating an NFA into a DFA means deciding whether a
+string is matched by a DFA is easy, but the number of states can grow
+exponentially, even after minimisation. Say a NFA has $n$ states, then
+in the worst case the corresponding minimal DFA that can match the
+same language as the NFA might contain $2^n$ of states. Unfortunately
+in many interesting cases this worst case bound is the dominant
+factor.
+
+
 By the way, we are not bothering with implementing the above
 minimisation algorithm: while up to now all the transformations used
 some clever composition of functions, the minimisation algorithm
@@ -1262,7 +1287,8 @@
 would require a more concrete representation of the transition
 function (like maps). If we did this, however, then many advantages of
 the functions would be thrown away. So the compromise is to not being
-able to minimise (easily) our DFAs.
+able to minimise (easily) our DFAs. We want to use regular expressions
+directly anyway.
 
 \subsection*{Brzozowski's Method}
 
--- a/progs/automata/dfa.sc	Tue Sep 01 16:00:37 2020 +0100
+++ b/progs/automata/dfa.sc	Wed Sep 02 23:34:19 2020 +0100
@@ -39,7 +39,7 @@
     case (Q4, 'b') => Q4 }
 
 val dfa = DFA(Q0, delta, Set[State](Q4))
-dfa.accepts("aaabbb".toList) 
+dfa.accepts("aaabbb".toList)    // true
 
 dfa.accepts("bbabaab".toList)   // true
 dfa.accepts("baba".toList)      // false
--- a/progs/automata/enfa.sc	Tue Sep 01 16:00:37 2020 +0100
+++ b/progs/automata/enfa.sc	Wed Sep 02 23:34:19 2020 +0100
@@ -3,8 +3,6 @@
 import $file.dfa, dfa._ 
 import $file.nfa, nfa._ 
 
-
-
 // a fixpoint construction
 import scala.annotation.tailrec
 @tailrec
--- a/progs/automata/nfa.sc	Tue Sep 01 16:00:37 2020 +0100
+++ b/progs/automata/nfa.sc	Wed Sep 02 23:34:19 2020 +0100
@@ -6,7 +6,6 @@
 import $file.dfa, dfa._ 
 
 
-
 // return an empty set, when a parial function is not defined
 import scala.util.Try
 def applyOrElse[A, B](f: A :=> Set[B], x: A) : Set[B] =
--- a/progs/automata/thompson.sc	Tue Sep 01 16:00:37 2020 +0100
+++ b/progs/automata/thompson.sc	Wed Sep 02 23:34:19 2020 +0100
@@ -24,15 +24,6 @@
 type NFAtrans = (TState, Char) :=> Set[TState]
 type eNFAtrans = (TState, Option[Char]) :=> Set[TState]
 
-
-// for composing an eNFA transition with an NFA transition
-// | is for set union
-implicit def nfaOps(f: eNFAtrans) = new {
-  def +++(g: NFAtrans) : eNFAtrans = 
-  { case (q, None) =>  applyOrElse(f, (q, None)) 
-    case (q, Some(c)) => applyOrElse(f, (q, Some(c))) | applyOrElse(g, (q, c))  }
-}
-
  
 // NFA that does not accept any string
 def NFA_ZERO(): NFAt = {
@@ -53,6 +44,16 @@
   NFA(Set(Q1), { case (Q1, d) if (c == d) => Set(Q2) }, Set(Q2))
 }
 
+
+// for composing an eNFA transition with an NFA transition
+// | is for set union
+implicit def nfaOps(f: eNFAtrans) = new {
+  def +++(g: NFAtrans) : eNFAtrans = 
+  { case (q, None) =>  applyOrElse(f, (q, None)) 
+    case (q, Some(c)) => applyOrElse(f, (q, Some(c))) | applyOrElse(g, (q, c))  }
+}
+
+
 // sequence of two NFAs
 def NFA_SEQ(enfa1: NFAt, enfa2: NFAt) : NFAt = {
   val new_delta : eNFAtrans = 
--- a/progs/catastrophic/catastrophic.js	Tue Sep 01 16:00:37 2020 +0100
+++ b/progs/catastrophic/catastrophic.js	Wed Sep 02 23:34:19 2020 +0100
@@ -7,11 +7,11 @@
 //
 // call with:
 //
-//  $> catastrophic.js 20
+//  $> ./catastrophic.js 20
 //
 // call with timing as:
 //
-//  $> time catastrophic.js 25
+//  $> time ./catastrophic.js 25
 
 
 const args = process.argv[2]
--- a/progs/catastrophic/catastrophic.py	Tue Sep 01 16:00:37 2020 +0100
+++ b/progs/catastrophic/catastrophic.py	Wed Sep 02 23:34:19 2020 +0100
@@ -1,4 +1,4 @@
-#!/usr/bin/env python
+#!/usr/bin/env python3
 
 import re
 import sys
@@ -19,5 +19,5 @@
 s = ("a" * int(cn))
 m = re.match('(a*)*b' , s) 
 
-print s
-print m
+print(s)
+print(m)
--- a/progs/catastrophic/catastrophic.rb	Tue Sep 01 16:00:37 2020 +0100
+++ b/progs/catastrophic/catastrophic.rb	Wed Sep 02 23:34:19 2020 +0100
@@ -1,3 +1,5 @@
+#!/usr/bin/env ruby
+
 # A case of catastrophic backtracking in Ruby
 #---------------------------------------------
 # example provided by Daniel Baldwin
@@ -8,12 +10,12 @@
 #
 # run on the command line with:
 #
-# ruby catastrophic.rb
+# ./catastrophic.rb
 #
 
-nums = (1..1000)
+nums = (1..30)
 
-#iterate through the nums 1-1000
+#iterate through the nums 1-30
 nums.each do |i|
 
 	start_time = Time.now
--- a/progs/catastrophic/catastrophic2.py	Tue Sep 01 16:00:37 2020 +0100
+++ b/progs/catastrophic/catastrophic2.py	Wed Sep 02 23:34:19 2020 +0100
@@ -1,4 +1,4 @@
-#!/usr/bin/env python
+#!/usr/bin/env python3
 import re
 import sys
 
@@ -9,7 +9,7 @@
 #
 # call with timing as:
 #
-#    time ./catastrophic.py 20
+#    time ./catastrophic2.py 20
 
 # counter n given on the command line
 cn = sys.argv[1]
@@ -18,4 +18,5 @@
 s = ("a" * int(cn))
 m = re.match('(a*)*b' , s) 
 
-print s
+print(s)
+print(m)