afl-material: comparison handouts/ho03.tex

equal deleted inserted replaced

-:8178fcf377dc
+:a697421eaa04
 \documentclass{article}
 \usepackage{../style}
 \usepackage{../langs}
 \usepackage{../graphics}
-%We can even allow ``silent
-%transitions'', also called epsilon-transitions. They allow us
-%to go from one state to the next without having a character
-%consumed. We label such silent transition with the letter
-%$\epsilon$.
 \begin{document}
 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017}
 \section*{Handout 3 (Finite Automata)}
 \begin{figure}[p]
 \small
 \lstinputlisting[numbers=left,linebackgroundcolor=
 {\ifodd\value{lstnumber}\color{capri!3}\fi}]
-{../progs/dfa.scala}
+{../progs/display/dfa.scala}
 \caption{A Scala implementation of DFAs using partial functions.
 Notice some subtleties: \texttt{deltas} implements the delta-hat
-construction by lifting the transition (partial) function to lists
+construction by lifting the (partial) transition  function to lists
 of characters. Since \texttt{delta} is given as a partial function,
 it can obviously go ``wrong'' in which case the \texttt{Try} in
 \texttt{accepts} catches the error and returns \texttt{false}---that
 means the string is not accepted.  The example \texttt{delta} in
 Line 28--38 implements the DFA example shown earlier in the
 Perhaps interestingly, I do not actually use relations for my NFAs,
 but use transition functions that return sets of states.  DFAs have
 partial transition functions that return a single state; my NFAs
 return a set. I let you think about this representation for
 NFA-transitions and how it corresponds to the relations used in the
-mathematical definition of NFAs.
+mathematical definition of NFAs. An example of a transition function
+in Scala for the NFA above is
-Like in the mathematical definition, \texttt{starts} is in NFAs a set
-of states; \texttt{fins} is again a function from states to
+{\small\begin{lstlisting}[language=Scala,linebackgroundcolor=
+{\ifodd\value{lstnumber}\color{capri!3}\fi}]
+val nfa_delta : (State, Char) :=> Set[State] =
+{ case (Q0, 'a') => Set(Q1, Q2)
+case (Q0, 'b') => Set(Q0)
+case (Q1, 'a') => Set(Q1, Q2)
+case (Q1, 'b') => Set(Q0, Q1) }
+\end{lstlisting}}
+\noindent Like in the mathematical definition, \texttt{starts} is in
+NFAs a set of states; \texttt{fins} is again a function from states to
 booleans. The \texttt{next} function calculates the set of next states
 reachable from a single state \texttt{q} by a character~\texttt{c}. In
 case there is no such state---the partial transition function is
 undefined---the empty set is returned (see function
 \texttt{applyOrElse} in Lines 9 and 10). The function \texttt{nexts}
 \begin{figure}[p]
 \small
 \lstinputlisting[numbers=left,linebackgroundcolor=
 {\ifodd\value{lstnumber}\color{capri!3}\fi}]
-{../progs/nfa.scala}
+{../progs/display/nfa.scala}
 \caption{A Scala implementation of NFAs using partial functions.
 Notice that the function \texttt{accepts} implements the
 acceptance of a string in a breath-first search fashion. This can be a costly
 way of deciding whether a string is accepted or not in applications that need to handle
 large NFAs and large inputs.\label{nfa}}
 Look very careful at the \texttt{accepts} and \texttt{deltas}
 functions in NFAs and remember that when accepting a string by a NFA
 we might have to explore all possible transitions (recall which state
 to go to is not unique anymore with NFAs\ldots{}we need to explore
 potentially all next states). The implementation achieves this
-exploration in a \emph{breadth-first search} manner. This is fine for
+exploration through a \emph{breadth-first search}. This is fine for
 small NFAs, but can lead to real memory problems when the NFAs are
 bigger and larger strings need to be processed. As result, some
 regular expression matching engines resort to a \emph{depth-first
 search} with \emph{backtracking} in unsuccessful cases. In our
 implementation we can implement a depth-first version of
 def accepts2(s: List[C]) : Boolean =
 starts.exists(search(_, s))
 \end{lstlisting}}
 \noindent
-This depth-first way of exploration seems to work efficiently in many
+This depth-first way of exploration seems to work quite efficiently in
-examples and is much less of a strain on memory. The problem is that
+many examples and is much less of a strain on memory. The problem is
-the backtracking can get ``catastrophic'' in some examples---remember
+that the backtracking can get ``catastrophic'' in some
-the catastrophic backtracking from earlier lectures. This depth-first
+examples---remember the catastrophic backtracking from earlier
-search with backtracking is the reason for the abysmal performance of
+lectures. This depth-first search with backtracking is the reason for
-some regular expression matchings in Java, Ruby and Python. I like to
+the abysmal performance of some regular expression matchings in Java,
-show you this next.
+Ruby and Python. I like to show you this in the next two sections.
-\subsubsection*{Thompson Construction}
+\subsubsection*{Epsilon NFAs}
 In order to get an idea what calculations are performed by Java \&
 friends, we need a method for transforming a regular expression into
 an automaton. This automaton should accept exactly those strings that
 are accepted by the regular expression.  The simplest and most
 well-known method for this is called \emph{Thompson Construction},
 after the Turing Award winner Ken Thompson. This method is by
-recursion over regular expressions and uses the non-determinism in
+recursion over regular expressions and depends on the non-determinism
-NFAs. You will see shortly why this construction works well with NFAs,
+in NFAs described in the earlier section. You will see shortly why
-but is not so straightforward with DFAs. Unfortunately we are still
+this construction works well with NFAs, but is not so straightforward
-one step away from our intended target though---because this
+with DFAs.
-construction uses a version of NFAs that allows ``silent
-transitions''. The idea behind silent transitions is that they
+Unfortunately we are still one step away from our intended target
-allow us to go from one state to the next without having to consume a
+though---because this construction uses a version of NFAs that allows
-character. We label such silent transition with the letter
+``silent transitions''. The idea behind silent transitions is that
-$\epsilon$ and call the automata $\epsilon$NFAs. Two
+they allow us to go from one state to the next without having to
-typical examples of $\epsilon$NFAs are
+consume a character. We label such silent transition with the letter
+$\epsilon$ and call the automata $\epsilon$NFAs. Two typical examples
+of $\epsilon$NFAs are:
 \begin{center}
 \begin{tabular}[t]{c@{\hspace{9mm}}c}
 \begin{tikzpicture}[>=stealth',very thick,
 \end{tikzpicture}}
 \end{tabular}
 \end{center}
 \noindent
-Consider the $\epsilon$NFA on the left-hand side: an
+Consider the $\epsilon$NFA on the left-hand side: the
-$\epsilon$-transition means you do not have to ``consume'' any part of
+$\epsilon$-transitions mean you do not have to ``consume'' any part of
-the input string, but ``silently'' change to a different state. For
+the input string, but ``silently'' change to a different state. In
-example, if you are in the starting state $Q_0$, you can silently move
+this example, if you are in the starting state $Q_0$, you can silently
-either to $Q_1$ or $Q_2$. In this example you can see that once you
+move either to $Q_1$ or $Q_2$. You can see that once you are in $Q_1$,
-are in $Q_1$, respectively $Q_2$, you cannot ``go back'' to the other
+respectively $Q_2$, you cannot ``go back'' to the other states. So it
-states.
+seems allowing $\epsilon$-transitions is a rather substancial
+extension to NFAs. On first appearances, $\epsilon$-transitions might
-On first appearances, $\epsilon$-transitions might look rather
+even look rather strange, or even silly. To start with, silent
-strange, or even silly. To start with, silent transitions make the
+transitions make the decision whether a string is accepted by an
-decision whether a string is accepted by an automaton even harder:
+automaton even harder: with $\epsilon$NFAs we have to look whether we
-with $\epsilon$NFAs we have to look whether we can do first some
+can do first some $\epsilon$-transitions and then do a
-$\epsilon$-transitions and then do a ``proper''-transition; and after
+``proper''-transition; and after any ``proper''-transition we again
-any ``proper''-transition we again have to check whether we can do
+have to check whether we can do again some silent transitions. Even
-again some silent transitions. Even worse, if there is a silent
+worse, if there is a silent transition pointing back to the same
-transition pointing back to the same state, then we have to be careful
+state, then we have to be careful our decision procedure for strings
-our decision procedure for strings does not loop (remember the
+does not loop (remember the depth-first search for exploring all
-depth-first search for exploring all states).
+states).
 The obvious question is: Do we get anything in return for this hassle
 with silent transitions?  Well, we still have to work for it\ldots
 unfortunately.  If we were to follow the many textbooks on the
 subject, we would now start with defining what $\epsilon$NFAs
 are---that would require extending the transition relation of
 NFAs. Next, show that the $\epsilon$NFAs are equivalent to NFAs and so
 on. Once we have done all this on paper, we would need to implement
-$\epsilon$NFAs. Lets try to take a shortcut---we are not really
+$\epsilon$NFAs. Lets try to take a shortcut instead. We are not really
 interested in $\epsilon$NFAs; they are only a convenient tool for
-translating regular expressions into automata. We are not going to
+translating regular expressions into automata. So we are not going to
-implementing them explicitly, but translate them directly into NFAs
+implementing them explicitly, but translate them immediately into NFAs
-(in a sense $\epsilon$NFAs are just a convenient API for lazy people).
+(in a sense $\epsilon$NFAs are just a convenient API for lazy people ;o).
-How does the translation work: well we have to find all transitions of
+How does the translation work? Well we have to find all transitions of
 the form
 \[
 q\stackrel{\epsilon}{\longrightarrow}\ldots\stackrel{\epsilon}{\longrightarrow}
 \;\stackrel{a}{\longrightarrow}\;
 \stackrel{\epsilon}{\longrightarrow}\ldots\stackrel{\epsilon}{\longrightarrow} q'
 \]
 \noindent and replace them with $q \stackrel{a}{\longrightarrow}
-q'$. Doing this to the $\epsilon$NFA on the left-hand side above gives
+q'$. Doing this to the $\epsilon$NFA on the right-hand side above gives
 the NFA
 \begin{center}
 \begin{tikzpicture}[>=stealth',very thick,
 every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]
 \path[->] (r_1) edge [loop below] node  {$a$} ();
 \path[->] (r_1) edge [bend right] node [below]  {$a$} (r_3);
 \end{tikzpicture}
 \end{center}
-\noindent where the single the $\epsilon$-transition is replaced by
+\noindent where the single $\epsilon$-transition is replaced by
-three more $a$-transitions. So whenever we are given $\epsilon$NFA we
+three additional $a$-transitions. Please do the calculations yourself
-will, similarly, replace it by an equivalent NFA. The code for this is
+and verify that I did not forget any transition.
-given in Figure~\ref{enfa}. At the core is a function that calculates
-a fixpoint of function (Lines 5--10). This is used for ``discovering''
+So in what follows, whenever we are given an $\epsilon$NFA we will
-states reachable by $\epsilon$-transitions. Once no new state is
+replace it by an equivalent NFA. The code for this is given in
-reachable state, a fixpoint is reached. This is used for example when
+Figure~\ref{enfa}. The main workhorse in this code is a function that
-calculating the starting states of an equivalent NFA---see Line 36.
+calculates a fixpoint of function (Lines 5--10). This function is used
-We starts with all starting states of the $\epsilon$NFA and then look
+for ``discovering'' which states are reachable by
-for all other states that can be reached by
+$\epsilon$-transitions. Once no new state is discovered, a fixpoint is
-$\epsilon$-transition. This is what the $\epsilon$-closure, called
+reached. This is used for example when calculating the starting states
-\texttt{ecl}, calculates.
+of an equivalent NFA (see Line 36): we start with all starting states
+of the $\epsilon$NFA and then look for all additional states that can
+be reached by $\epsilon$-transitions. We keep on doing this until no
+new state can be reached. This is what the $\epsilon$-closure, named
+in the code \texttt{ecl}, calculates. Similarly, an accepting state of
+the NFA is when we can reach an accepting state of the $\epsilon$NFA
+by $\epsilon$-transitions.
 \begin{figure}[p]
 \small
 \lstinputlisting[numbers=left,linebackgroundcolor=
 {\ifodd\value{lstnumber}\color{capri!3}\fi}]
-{../progs/enfa.scala}
+{../progs/display/enfa.scala}
 \caption{A Scala function that translates $\epsilon$NFA into NFAs. The
 transtions of $\epsilon$NFA take as input an \texttt{Option[C]}.
 \texttt{None} stands for an $\epsilon$-transition; \texttt{Some(c)}
 for a ``proper'' transition. The functions in Lines 18--26 calculate
 all states reachable by one or more $\epsilon$-transition for a given
 set of states. The NFA is constructed in in Lines 36--38.\label{enfa}}
 \end{figure}
+Also look carefully how the transitions of $\epsilon$NFAs are
-Now having a translation of $\epsilon$NFAs to NFAs in place, we can
+implemented. The additional possibility of performing silent
-finally make headway with the problem of translating regular
+transitions is encoded by using \texttt{Option[C]} as the type for the
-expressions into equivalent NFAs. By equivalent we mean that the NFAs
+``input''. The \texttt{Some}s stand for ``propper'' transitions where
+a character is consumed; \texttt{None} stands for
+$\epsilon$-transitions. The transition functions for the two
+$\epsilon$NFAs from the beginning of this section can be defined as
+{\small\begin{lstlisting}[language=Scala,linebackgroundcolor=
+{\ifodd\value{lstnumber}\color{capri!3}\fi}]
+val enfa_trans1 : (State, Option[Char]) :=> Set[State] =
+{ case (Q0, Some('a')) => Set(Q0)
+case (Q0, None) => Set(Q1, Q2)
+case (Q1, Some('a')) => Set(Q1)
+case (Q2, Some('b')) => Set(Q2) }
+val enfa_trans2 : (State, Option[Char]) :=> Set[State] =
+{ case (R1, Some('b')) => Set(R3)
+case (R1, None) => Set(R2)
+case (R2, Some('a')) => Set(R1, R3) }
+\end{lstlisting}}
+\noindent
+I hope you agree now with my earlier statement that the $\epsilon$NFAs
+are just an API for NFAs.
+\subsubsection*{Thompson Construction}
+Having the translation of $\epsilon$NFAs to NFAs in place, we can
+finally return to the problem of translating regular expressions into
+equivalent NFAs. Recall that by equivalent we mean that the NFAs
 recognise the same language. Consider the simple regular expressions
 $\ZERO$, $\ONE$ and $c$. They can be translated into equivalent NFAs
 as follows:
 \begin{center}
 \end{tikzpicture}\\\\
 \raisebox{1mm}{$\ONE$} &
 \begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
 \node[state, initial, accepting]  (Q_0)  {$\mbox{}$};
 \end{tikzpicture}\\\\
-\raisebox{2mm}{$c$} &
+\raisebox{3mm}{$c$} &
 \begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
 \node[state, initial]  (Q_0)  {$\mbox{}$};
 \node[state, accepting]  (Q_1)  [right=of Q_0] {$\mbox{}$};
 \path[->] (Q_0) edge node [below]  {$c$} (Q_1);
-\end{tikzpicture}\\\\
+\end{tikzpicture}\\
 \end{tabular}
 \end{center}
+\noindent
+I let you think whether the NFAs can match exactly those strings the
+regular expressions can match. To do this translation in code we need
+a way to construct states programatically...and as an additional
+constrain Scala needs to recognise these states as being distinct.
+For this I implemented in Figure~\ref{thompson1} a class
+\texttt{TState} that includes a counter and a companion object that
+increases this counter whenever a state is created.\footnote{You might
+have to read up what \emph{companion objects} are in Scala.}
 \begin{figure}[p]
 \small
 \lstinputlisting[numbers=left,linebackgroundcolor=
 {\ifodd\value{lstnumber}\color{capri!3}\fi}]
-{../progs/thompson.scala}
+{../progs/display/thompson1.scala}
-\caption{A Scala XXX\label{thompson}}
+\caption{The first part of the Thompson Construction. Lines 7--16
+implement a way how to create states that are all
+distinct by virtue of a counter. This counter is
+increased in the companion object of \texttt{TState}
+whenever a new state is created. The code in Lines 24--40
+constructs NFAs for the simple regular expressions.
+\label{thompson1}}
 \end{figure}
+\begin{figure}[p]
-\noindent The case for the sequence regular expression $r_1
+\small
-\cdot r_2$ is as follows: We are given by recursion two
+\lstinputlisting[numbers=left,linebackgroundcolor=
-automata representing $r_1$ and $r_2$ respectively.
+{\ifodd\value{lstnumber}\color{capri!3}\fi}]
+{../progs/display/thompson2.scala}
+\caption{The second part of the Thompson Construction implementing
+the composition of NFAs according to $\cdot$, $+$ and $\_^*$.
+The implicit class about rich partial functions
+implements the infix operation \texttt{+++} which
+combines an $\epsilon$NFA transition with a NFA transition
+(both given as partial functions).\label{thompson2}}
+\end{figure}
+The case for the sequence regular expression $r_1 \cdot r_2$ is as
+follows: We are given by recursion two automata representing $r_1$ and
+$r_2$ respectively.
 \begin{center}
 \begin{tikzpicture}[node distance=3mm,
 >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
 \node[state, initial]  (Q_0)  {$\mbox{}$};

changeset 487	a697421eaa04
parent 485	bd6d999ab7b6
child 488	598741d39d21