handouts/ho03.tex
changeset 487 a697421eaa04
parent 485 bd6d999ab7b6
child 488 598741d39d21
--- a/handouts/ho03.tex	Tue Apr 25 12:33:16 2017 +0100
+++ b/handouts/ho03.tex	Fri Apr 28 11:01:25 2017 +0100
@@ -3,11 +3,6 @@
 \usepackage{../langs}
 \usepackage{../graphics}
 
-%We can even allow ``silent
-%transitions'', also called epsilon-transitions. They allow us
-%to go from one state to the next without having a character
-%consumed. We label such silent transition with the letter
-%$\epsilon$.
 
 \begin{document}
 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017}
@@ -130,10 +125,10 @@
 \small
 \lstinputlisting[numbers=left,linebackgroundcolor=
                   {\ifodd\value{lstnumber}\color{capri!3}\fi}]
-                  {../progs/dfa.scala}
+                  {../progs/display/dfa.scala}
 \caption{A Scala implementation of DFAs using partial functions.
   Notice some subtleties: \texttt{deltas} implements the delta-hat
-  construction by lifting the transition (partial) function to lists
+  construction by lifting the (partial) transition  function to lists
   of characters. Since \texttt{delta} is given as a partial function,
   it can obviously go ``wrong'' in which case the \texttt{Try} in
   \texttt{accepts} catches the error and returns \texttt{false}---that
@@ -266,10 +261,20 @@
 partial transition functions that return a single state; my NFAs
 return a set. I let you think about this representation for
 NFA-transitions and how it corresponds to the relations used in the
-mathematical definition of NFAs.
+mathematical definition of NFAs. An example of a transition function
+in Scala for the NFA above is
 
-Like in the mathematical definition, \texttt{starts} is in NFAs a set
-of states; \texttt{fins} is again a function from states to
+{\small\begin{lstlisting}[language=Scala,linebackgroundcolor=
+                  {\ifodd\value{lstnumber}\color{capri!3}\fi}]
+val nfa_delta : (State, Char) :=> Set[State] = 
+  { case (Q0, 'a') => Set(Q1, Q2)
+    case (Q0, 'b') => Set(Q0)
+    case (Q1, 'a') => Set(Q1, Q2)
+    case (Q1, 'b') => Set(Q0, Q1) }
+\end{lstlisting}}  
+
+\noindent Like in the mathematical definition, \texttt{starts} is in
+NFAs a set of states; \texttt{fins} is again a function from states to
 booleans. The \texttt{next} function calculates the set of next states
 reachable from a single state \texttt{q} by a character~\texttt{c}. In
 case there is no such state---the partial transition function is
@@ -281,7 +286,7 @@
 \small
 \lstinputlisting[numbers=left,linebackgroundcolor=
                   {\ifodd\value{lstnumber}\color{capri!3}\fi}]
-                  {../progs/nfa.scala}
+                  {../progs/display/nfa.scala}
 \caption{A Scala implementation of NFAs using partial functions.
   Notice that the function \texttt{accepts} implements the
   acceptance of a string in a breath-first search fashion. This can be a costly
@@ -294,7 +299,7 @@
 we might have to explore all possible transitions (recall which state
 to go to is not unique anymore with NFAs\ldots{}we need to explore
 potentially all next states). The implementation achieves this
-exploration in a \emph{breadth-first search} manner. This is fine for
+exploration through a \emph{breadth-first search}. This is fine for
 small NFAs, but can lead to real memory problems when the NFAs are
 bigger and larger strings need to be processed. As result, some
 regular expression matching engines resort to a \emph{depth-first
@@ -315,16 +320,16 @@
 \end{lstlisting}}
 
 \noindent
-This depth-first way of exploration seems to work efficiently in many
-examples and is much less of a strain on memory. The problem is that
-the backtracking can get ``catastrophic'' in some examples---remember
-the catastrophic backtracking from earlier lectures. This depth-first
-search with backtracking is the reason for the abysmal performance of
-some regular expression matchings in Java, Ruby and Python. I like to
-show you this next.
+This depth-first way of exploration seems to work quite efficiently in
+many examples and is much less of a strain on memory. The problem is
+that the backtracking can get ``catastrophic'' in some
+examples---remember the catastrophic backtracking from earlier
+lectures. This depth-first search with backtracking is the reason for
+the abysmal performance of some regular expression matchings in Java,
+Ruby and Python. I like to show you this in the next two sections.
 
 
-\subsubsection*{Thompson Construction}
+\subsubsection*{Epsilon NFAs}
 
 In order to get an idea what calculations are performed by Java \&
 friends, we need a method for transforming a regular expression into
@@ -332,16 +337,18 @@
 are accepted by the regular expression.  The simplest and most
 well-known method for this is called \emph{Thompson Construction},
 after the Turing Award winner Ken Thompson. This method is by
-recursion over regular expressions and uses the non-determinism in
-NFAs. You will see shortly why this construction works well with NFAs,
-but is not so straightforward with DFAs. Unfortunately we are still
-one step away from our intended target though---because this
-construction uses a version of NFAs that allows ``silent
-transitions''. The idea behind silent transitions is that they
-allow us to go from one state to the next without having to consume a
-character. We label such silent transition with the letter
-$\epsilon$ and call the automata $\epsilon$NFAs. Two
-typical examples of $\epsilon$NFAs are
+recursion over regular expressions and depends on the non-determinism
+in NFAs described in the earlier section. You will see shortly why
+this construction works well with NFAs, but is not so straightforward
+with DFAs.
+
+Unfortunately we are still one step away from our intended target
+though---because this construction uses a version of NFAs that allows
+``silent transitions''. The idea behind silent transitions is that
+they allow us to go from one state to the next without having to
+consume a character. We label such silent transition with the letter
+$\epsilon$ and call the automata $\epsilon$NFAs. Two typical examples
+of $\epsilon$NFAs are:
 
 
 \begin{center}
@@ -373,24 +380,24 @@
 \end{center}
 
 \noindent
-Consider the $\epsilon$NFA on the left-hand side: an
-$\epsilon$-transition means you do not have to ``consume'' any part of
-the input string, but ``silently'' change to a different state. For
-example, if you are in the starting state $Q_0$, you can silently move
-either to $Q_1$ or $Q_2$. In this example you can see that once you
-are in $Q_1$, respectively $Q_2$, you cannot ``go back'' to the other
-states.
-
-On first appearances, $\epsilon$-transitions might look rather
-strange, or even silly. To start with, silent transitions make the
-decision whether a string is accepted by an automaton even harder:
-with $\epsilon$NFAs we have to look whether we can do first some
-$\epsilon$-transitions and then do a ``proper''-transition; and after
-any ``proper''-transition we again have to check whether we can do
-again some silent transitions. Even worse, if there is a silent
-transition pointing back to the same state, then we have to be careful
-our decision procedure for strings does not loop (remember the
-depth-first search for exploring all states).
+Consider the $\epsilon$NFA on the left-hand side: the
+$\epsilon$-transitions mean you do not have to ``consume'' any part of
+the input string, but ``silently'' change to a different state. In
+this example, if you are in the starting state $Q_0$, you can silently
+move either to $Q_1$ or $Q_2$. You can see that once you are in $Q_1$,
+respectively $Q_2$, you cannot ``go back'' to the other states. So it
+seems allowing $\epsilon$-transitions is a rather substancial
+extension to NFAs. On first appearances, $\epsilon$-transitions might
+even look rather strange, or even silly. To start with, silent
+transitions make the decision whether a string is accepted by an
+automaton even harder: with $\epsilon$NFAs we have to look whether we
+can do first some $\epsilon$-transitions and then do a
+``proper''-transition; and after any ``proper''-transition we again
+have to check whether we can do again some silent transitions. Even
+worse, if there is a silent transition pointing back to the same
+state, then we have to be careful our decision procedure for strings
+does not loop (remember the depth-first search for exploring all
+states).
 
 The obvious question is: Do we get anything in return for this hassle
 with silent transitions?  Well, we still have to work for it\ldots
@@ -399,12 +406,12 @@
 are---that would require extending the transition relation of
 NFAs. Next, show that the $\epsilon$NFAs are equivalent to NFAs and so
 on. Once we have done all this on paper, we would need to implement
-$\epsilon$NFAs. Lets try to take a shortcut---we are not really
+$\epsilon$NFAs. Lets try to take a shortcut instead. We are not really
 interested in $\epsilon$NFAs; they are only a convenient tool for
-translating regular expressions into automata. We are not going to
-implementing them explicitly, but translate them directly into NFAs
-(in a sense $\epsilon$NFAs are just a convenient API for lazy people).
-How does the translation work: well we have to find all transitions of
+translating regular expressions into automata. So we are not going to
+implementing them explicitly, but translate them immediately into NFAs
+(in a sense $\epsilon$NFAs are just a convenient API for lazy people ;o).
+How does the translation work? Well we have to find all transitions of
 the form
 
 \[
@@ -414,7 +421,7 @@
 \]
 
 \noindent and replace them with $q \stackrel{a}{\longrightarrow}
-q'$. Doing this to the $\epsilon$NFA on the left-hand side above gives
+q'$. Doing this to the $\epsilon$NFA on the right-hand side above gives
 the NFA
 
 \begin{center}
@@ -432,24 +439,31 @@
 \end{tikzpicture}
 \end{center}
 
-\noindent where the single the $\epsilon$-transition is replaced by
-three more $a$-transitions. So whenever we are given $\epsilon$NFA we
-will, similarly, replace it by an equivalent NFA. The code for this is
-given in Figure~\ref{enfa}. At the core is a function that calculates
-a fixpoint of function (Lines 5--10). This is used for ``discovering''
-states reachable by $\epsilon$-transitions. Once no new state is
-reachable state, a fixpoint is reached. This is used for example when
-calculating the starting states of an equivalent NFA---see Line 36.
-We starts with all starting states of the $\epsilon$NFA and then look
-for all other states that can be reached by
-$\epsilon$-transition. This is what the $\epsilon$-closure, called
-\texttt{ecl}, calculates.
+\noindent where the single $\epsilon$-transition is replaced by
+three additional $a$-transitions. Please do the calculations yourself
+and verify that I did not forget any transition.
+
+So in what follows, whenever we are given an $\epsilon$NFA we will
+replace it by an equivalent NFA. The code for this is given in
+Figure~\ref{enfa}. The main workhorse in this code is a function that
+calculates a fixpoint of function (Lines 5--10). This function is used
+for ``discovering'' which states are reachable by
+$\epsilon$-transitions. Once no new state is discovered, a fixpoint is
+reached. This is used for example when calculating the starting states
+of an equivalent NFA (see Line 36): we start with all starting states
+of the $\epsilon$NFA and then look for all additional states that can
+be reached by $\epsilon$-transitions. We keep on doing this until no
+new state can be reached. This is what the $\epsilon$-closure, named
+in the code \texttt{ecl}, calculates. Similarly, an accepting state of
+the NFA is when we can reach an accepting state of the $\epsilon$NFA
+by $\epsilon$-transitions.
+
 
 \begin{figure}[p]
 \small
 \lstinputlisting[numbers=left,linebackgroundcolor=
                   {\ifodd\value{lstnumber}\color{capri!3}\fi}]
-                  {../progs/enfa.scala}
+                  {../progs/display/enfa.scala}
 
 \caption{A Scala function that translates $\epsilon$NFA into NFAs. The
   transtions of $\epsilon$NFA take as input an \texttt{Option[C]}.
@@ -459,10 +473,37 @@
   set of states. The NFA is constructed in in Lines 36--38.\label{enfa}}
 \end{figure}
 
+Also look carefully how the transitions of $\epsilon$NFAs are
+implemented. The additional possibility of performing silent
+transitions is encoded by using \texttt{Option[C]} as the type for the
+``input''. The \texttt{Some}s stand for ``propper'' transitions where
+a character is consumed; \texttt{None} stands for
+$\epsilon$-transitions. The transition functions for the two
+$\epsilon$NFAs from the beginning of this section can be defined as
 
-Now having a translation of $\epsilon$NFAs to NFAs in place, we can
-finally make headway with the problem of translating regular
-expressions into equivalent NFAs. By equivalent we mean that the NFAs
+{\small\begin{lstlisting}[language=Scala,linebackgroundcolor=
+                  {\ifodd\value{lstnumber}\color{capri!3}\fi}]
+val enfa_trans1 : (State, Option[Char]) :=> Set[State] = 
+  { case (Q0, Some('a')) => Set(Q0)
+    case (Q0, None) => Set(Q1, Q2)
+    case (Q1, Some('a')) => Set(Q1)
+    case (Q2, Some('b')) => Set(Q2) }
+
+val enfa_trans2 : (State, Option[Char]) :=> Set[State] = 
+  { case (R1, Some('b')) => Set(R3)
+    case (R1, None) => Set(R2)
+    case (R2, Some('a')) => Set(R1, R3) }
+\end{lstlisting}}
+
+\noindent
+I hope you agree now with my earlier statement that the $\epsilon$NFAs
+are just an API for NFAs.
+
+\subsubsection*{Thompson Construction}
+
+Having the translation of $\epsilon$NFAs to NFAs in place, we can
+finally return to the problem of translating regular expressions into
+equivalent NFAs. Recall that by equivalent we mean that the NFAs
 recognise the same language. Consider the simple regular expressions
 $\ZERO$, $\ONE$ and $c$. They can be translated into equivalent NFAs
 as follows:
@@ -477,27 +518,55 @@
 \begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
 \node[state, initial, accepting]  (Q_0)  {$\mbox{}$};
 \end{tikzpicture}\\\\
-\raisebox{2mm}{$c$} & 
+\raisebox{3mm}{$c$} & 
 \begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
 \node[state, initial]  (Q_0)  {$\mbox{}$};
 \node[state, accepting]  (Q_1)  [right=of Q_0] {$\mbox{}$};
 \path[->] (Q_0) edge node [below]  {$c$} (Q_1);
-\end{tikzpicture}\\\\
+\end{tikzpicture}\\
 \end{tabular}
 \end{center}
 
+\noindent
+I let you think whether the NFAs can match exactly those strings the
+regular expressions can match. To do this translation in code we need
+a way to construct states programatically...and as an additional
+constrain Scala needs to recognise these states as being distinct.
+For this I implemented in Figure~\ref{thompson1} a class
+\texttt{TState} that includes a counter and a companion object that
+increases this counter whenever a state is created.\footnote{You might
+  have to read up what \emph{companion objects} are in Scala.}
+
 \begin{figure}[p]
 \small
 \lstinputlisting[numbers=left,linebackgroundcolor=
                   {\ifodd\value{lstnumber}\color{capri!3}\fi}]
-                  {../progs/thompson.scala}
-\caption{A Scala XXX\label{thompson}}
+                  {../progs/display/thompson1.scala}
+\caption{The first part of the Thompson Construction. Lines 7--16
+  implement a way how to create states that are all
+  distinct by virtue of a counter. This counter is
+  increased in the companion object of \texttt{TState}
+  whenever a new state is created. The code in Lines 24--40
+  constructs NFAs for the simple regular expressions.
+  \label{thompson1}}
 \end{figure}
 
+\begin{figure}[p]
+\small
+\lstinputlisting[numbers=left,linebackgroundcolor=
+                  {\ifodd\value{lstnumber}\color{capri!3}\fi}]
+                  {../progs/display/thompson2.scala}
+\caption{The second part of the Thompson Construction implementing
+  the composition of NFAs according to $\cdot$, $+$ and $\_^*$.
+  The implicit class about rich partial functions
+  implements the infix operation \texttt{+++} which
+  combines an $\epsilon$NFA transition with a NFA transition
+  (both given as partial functions).\label{thompson2}}
+\end{figure}
 
-\noindent The case for the sequence regular expression $r_1
-\cdot r_2$ is as follows: We are given by recursion two
-automata representing $r_1$ and $r_2$ respectively. 
+The case for the sequence regular expression $r_1 \cdot r_2$ is as
+follows: We are given by recursion two automata representing $r_1$ and
+$r_2$ respectively.
 
 \begin{center}
 \begin{tikzpicture}[node distance=3mm,