afl-material: comparison handouts/ho03.tex

equal deleted inserted replaced

-:a697421eaa04
+:598741d39d21
 \begin{document}
 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017}
 \section*{Handout 3 (Finite Automata)}
 Every formal language and compiler course I know of bombards you first
 with automata and then to a much, much smaller extend with regular
 expressions. As you can see, this course is turned upside down:
 regular expressions come first. The reason is that regular expressions
 are easier to reason about and the notion of derivatives, although
 already quite old, only became more widely known rather
-recently. Still let us in this lecture have a closer look at automata
+recently. Still, let us in this lecture have a closer look at automata
 and their relation to regular expressions. This will help us with
 understanding why the regular expression matchers in Python, Ruby and
 Java are so slow with certain regular expressions. On the way we will
-also see what are the limitations of regular expressions.
+also see what are the limitations of regular
+expressions. Unfortunately, they cannot be used for \emph{everything}.
 \subsection*{Deterministic Finite Automata}
 Lets start\ldots the central definition is:\medskip
 \item $Q_0 \in Qs$ is the start state,
 \item $F \subseteq Qs$ are the accepting states, and
 \item $\delta$ is the transition function.
 \end{itemize}
-\noindent The transition function determines how to ``transition''
+\noindent I am sure you have seen this defininition already
-from one state to the next state with respect to a character. We have
+before. The transition function determines how to ``transition'' from
-the assumption that these transition functions do not need to be
+one state to the next state with respect to a character. We have the
-defined everywhere: so it can be the case that given a character there
+assumption that these transition functions do not need to be defined
-is no next state, in which case we need to raise a kind of ``failure
+everywhere: so it can be the case that given a character there is no
+next state, in which case we need to raise a kind of ``failure
 exception''.  That means actually we have \emph{partial} functions as
 transitions---see the Scala implementation of DFAs later on.  A
 typical example of a DFA is
 \begin{center}
 \[
 \widehat{\delta}(Q_0, s) \in F
 \]
-\noindent I let you think about a definition that describes
+\noindent I let you think about a definition that describes the set of
-the set of all strings accepted by an automaton.
+all strings accepted by a determinsitic finite automaton.
 \begin{figure}[p]
 \small
 \lstinputlisting[numbers=left,linebackgroundcolor=
 {\ifodd\value{lstnumber}\color{capri!3}\fi}]
 {../progs/display/dfa.scala}
 \caption{A Scala implementation of DFAs using partial functions.
-Notice some subtleties: \texttt{deltas} implements the delta-hat
+Note some subtleties: \texttt{deltas} implements the delta-hat
 construction by lifting the (partial) transition  function to lists
 of characters. Since \texttt{delta} is given as a partial function,
 it can obviously go ``wrong'' in which case the \texttt{Try} in
 \texttt{accepts} catches the error and returns \texttt{false}---that
 means the string is not accepted.  The example \texttt{delta} in
 This NFA happens to have only one starting state, but in general there
 could be more.  Notice that in state $Q_0$ we might go to state $Q_1$
 \emph{or} to state $Q_2$ when receiving an $a$. Similarly in state
 $Q_1$ and receiving an $a$, we can stay in state $Q_1$ \emph{or} go to
 $Q_2$. This kind of choice is not allowed with DFAs. The downside of
-this choice is that when it comes to deciding whether a string is
+this choice in NFAs is that when it comes to deciding whether a string is
 accepted by a NFA we potentially have to explore all possibilities. I
-let you think which kind of strings the above NFA accepts.
+let you think which strings the above NFA accepts.
 There are a number of additional points you should note about
 NFAs. Every DFA is a NFA, but not vice versa. The $\rho$ in NFAs is a
 transition \emph{relation} (DFAs have a transition function). The
 My implementation of NFAs in Scala is shown in Figure~\ref{nfa}.
 Perhaps interestingly, I do not actually use relations for my NFAs,
 but use transition functions that return sets of states.  DFAs have
 partial transition functions that return a single state; my NFAs
-return a set. I let you think about this representation for
+return a set of states. I let you think about this representation for
 NFA-transitions and how it corresponds to the relations used in the
 mathematical definition of NFAs. An example of a transition function
-in Scala for the NFA above is
+in Scala for the NFA shown above is
 {\small\begin{lstlisting}[language=Scala,linebackgroundcolor=
 {\ifodd\value{lstnumber}\color{capri!3}\fi}]
 val nfa_delta : (State, Char) :=> Set[State] =
 { case (Q0, 'a') => Set(Q1, Q2)
 an automaton. This automaton should accept exactly those strings that
 are accepted by the regular expression.  The simplest and most
 well-known method for this is called \emph{Thompson Construction},
 after the Turing Award winner Ken Thompson. This method is by
 recursion over regular expressions and depends on the non-determinism
-in NFAs described in the earlier section. You will see shortly why
+in NFAs described in the previous section. You will see shortly why
 this construction works well with NFAs, but is not so straightforward
 with DFAs.
 Unfortunately we are still one step away from our intended target
 though---because this construction uses a version of NFAs that allows
 The obvious question is: Do we get anything in return for this hassle
 with silent transitions?  Well, we still have to work for it\ldots
 unfortunately.  If we were to follow the many textbooks on the
 subject, we would now start with defining what $\epsilon$NFAs
 are---that would require extending the transition relation of
-NFAs. Next, show that the $\epsilon$NFAs are equivalent to NFAs and so
+NFAs. Next, we woudl show that the $\epsilon$NFAs are equivalent to
-on. Once we have done all this on paper, we would need to implement
+NFAs and so on. Once we have done all this on paper, we would need to
-$\epsilon$NFAs. Lets try to take a shortcut instead. We are not really
+implement $\epsilon$NFAs. Lets try to take a shortcut instead. We are
-interested in $\epsilon$NFAs; they are only a convenient tool for
+not really interested in $\epsilon$NFAs; they are only a convenient
-translating regular expressions into automata. So we are not going to
+tool for translating regular expressions into automata. So we are not
-implementing them explicitly, but translate them immediately into NFAs
+going to implementing them explicitly, but translate them immediately
-(in a sense $\epsilon$NFAs are just a convenient API for lazy people ;o).
+into NFAs (in a sense $\epsilon$NFAs are just a convenient API for
-How does the translation work? Well we have to find all transitions of
+lazy people ;o).  How does the translation work? Well we have to find
-the form
+all transitions of the form
 \[
 q\stackrel{\epsilon}{\longrightarrow}\ldots\stackrel{\epsilon}{\longrightarrow}
 \;\stackrel{a}{\longrightarrow}\;
 \stackrel{\epsilon}{\longrightarrow}\ldots\stackrel{\epsilon}{\longrightarrow} q'
 \noindent where the single $\epsilon$-transition is replaced by
 three additional $a$-transitions. Please do the calculations yourself
 and verify that I did not forget any transition.
 So in what follows, whenever we are given an $\epsilon$NFA we will
-replace it by an equivalent NFA. The code for this is given in
+replace it by an equivalent NFA. The Scala code for this translation
-Figure~\ref{enfa}. The main workhorse in this code is a function that
+is given in Figure~\ref{enfa}. The main workhorse in this code is a
-calculates a fixpoint of function (Lines 5--10). This function is used
+function that calculates a fixpoint of function (Lines 5--10). This
-for ``discovering'' which states are reachable by
+function is used for ``discovering'' which states are reachable by
 $\epsilon$-transitions. Once no new state is discovered, a fixpoint is
 reached. This is used for example when calculating the starting states
 of an equivalent NFA (see Line 36): we start with all starting states
 of the $\epsilon$NFA and then look for all additional states that can
 be reached by $\epsilon$-transitions. We keep on doing this until no
 \lstinputlisting[numbers=left,linebackgroundcolor=
 {\ifodd\value{lstnumber}\color{capri!3}\fi}]
 {../progs/display/enfa.scala}
 \caption{A Scala function that translates $\epsilon$NFA into NFAs. The
-transtions of $\epsilon$NFA take as input an \texttt{Option[C]}.
+transtion function of $\epsilon$NFA takes as input an \texttt{Option[C]}.
 \texttt{None} stands for an $\epsilon$-transition; \texttt{Some(c)}
-for a ``proper'' transition. The functions in Lines 18--26 calculate
+for a ``proper'' transition consuming a character. The functions in
+Lines 18--26 calculate
 all states reachable by one or more $\epsilon$-transition for a given
-set of states. The NFA is constructed in in Lines 36--38.\label{enfa}}
+set of states. The NFA is constructed in Lines 36--38.\label{enfa}}
 \end{figure}
 Also look carefully how the transitions of $\epsilon$NFAs are
 implemented. The additional possibility of performing silent
 transitions is encoded by using \texttt{Option[C]} as the type for the
 equivalent NFAs. Recall that by equivalent we mean that the NFAs
 recognise the same language. Consider the simple regular expressions
 $\ZERO$, $\ONE$ and $c$. They can be translated into equivalent NFAs
 as follows:
-\begin{center}
+\begin{equation}\mbox{
 \begin{tabular}[t]{l@{\hspace{10mm}}l}
 \raisebox{1mm}{$\ZERO$} &
 \begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
 \node[state, initial]  (Q_0)  {$\mbox{}$};
 \end{tikzpicture}\\\\
 \begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
 \node[state, initial]  (Q_0)  {$\mbox{}$};
 \node[state, accepting]  (Q_1)  [right=of Q_0] {$\mbox{}$};
 \path[->] (Q_0) edge node [below]  {$c$} (Q_1);
 \end{tikzpicture}\\
-\end{tabular}
+\end{tabular}}\label{simplecases}
-\end{center}
+\end{equation}
 \noindent
 I let you think whether the NFAs can match exactly those strings the
 regular expressions can match. To do this translation in code we need
 a way to construct states programatically...and as an additional
-constrain Scala needs to recognise these states as being distinct.
+constrain Scala needs to recognise that these states are being distinct.
 For this I implemented in Figure~\ref{thompson1} a class
 \texttt{TState} that includes a counter and a companion object that
-increases this counter whenever a state is created.\footnote{You might
+increases this counter whenever a new state is created.\footnote{You might
-have to read up what \emph{companion objects} are in Scala.}
+have to read up what \emph{companion objects} do in Scala.}
 \begin{figure}[p]
 \small
 \lstinputlisting[numbers=left,linebackgroundcolor=
 {\ifodd\value{lstnumber}\color{capri!3}\fi}]
 {../progs/display/thompson1.scala}
 \caption{The first part of the Thompson Construction. Lines 7--16
-implement a way how to create states that are all
+implement a way of how to create new states that are all
 distinct by virtue of a counter. This counter is
 increased in the companion object of \texttt{TState}
 whenever a new state is created. The code in Lines 24--40
-constructs NFAs for the simple regular expressions.
+constructs NFAs for the simple regular expressions $\ZERO$, $\ONE$ and $c$.
+Compare the pictures given in \eqref{simplecases}.
 \label{thompson1}}
 \end{figure}
 \begin{figure}[p]
 \small
 implements the infix operation \texttt{+++} which
 combines an $\epsilon$NFA transition with a NFA transition
 (both given as partial functions).\label{thompson2}}
 \end{figure}
-The case for the sequence regular expression $r_1 \cdot r_2$ is as
+The case for the sequence regular expression $r_1 \cdot r_2$ is a bit more
-follows: We are given by recursion two automata representing $r_1$ and
+complicated: We are given by recursion two NFAs representing the regular
-$r_2$ respectively.
+expressions $r_1$ and $r_2$ respectively.
 \begin{center}
 \begin{tikzpicture}[node distance=3mm,
->=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
+>=stealth',very thick,
+every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
 \node[state, initial]  (Q_0)  {$\mbox{}$};
-\node (r_1)  [right=of Q_0] {$\ldots$};
+\node[state, initial]  (Q_01) [below=1mm of Q_0] {$\mbox{}$};
-\node[state, accepting]  (t_1)  [right=of r_1] {$\mbox{}$};
+\node[state, initial]  (Q_02) [above=1mm of Q_0] {$\mbox{}$};
-\node[state, accepting]  (t_2)  [above=of t_1] {$\mbox{}$};
+\node (R_1)  [right=of Q_0] {$\ldots$};
-\node[state, accepting]  (t_3)  [below=of t_1] {$\mbox{}$};
+\node[state, accepting]  (T_1)  [right=of R_1] {$\mbox{}$};
-\node[state, initial]  (a_0)  [right=2.5cm of t_1] {$\mbox{}$};
+\node[state, accepting]  (T_2)  [above=of T_1] {$\mbox{}$};
-\node (b_1)  [right=of a_0] {$\ldots$};
+\node[state, accepting]  (T_3)  [below=of T_1] {$\mbox{}$};
+\node (A_0)  [right=2.5cm of T_1] {$\mbox{}$};
+\node[state, initial]  (A_01)  [above=1mm of A_0] {$\mbox{}$};
+\node[state, initial]  (A_02)  [below=1mm of A_0] {$\mbox{}$};
+\node (b_1)  [right=of A_0] {$\ldots$};
 \node[state, accepting]  (c_1)  [right=of b_1] {$\mbox{}$};
 \node[state, accepting]  (c_2)  [above=of c_1] {$\mbox{}$};
 \node[state, accepting]  (c_3)  [below=of c_1] {$\mbox{}$};
 \begin{pgfonlayer}{background}
-\node (1) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (Q_0) (r_1) (t_1) (t_2) (t_3)] {};
+\node (1) [rounded corners, inner sep=1mm, thick,
-\node (2) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (a_0) (b_1) (c_1) (c_2) (c_3)] {};
+draw=black!60, fill=black!20, fit= (Q_0) (R_1) (T_1) (T_2) (T_3)] {};
+\node (2) [rounded corners, inner sep=1mm, thick,
+draw=black!60, fill=black!20, fit= (A_0) (b_1) (c_1) (c_2) (c_3)] {};
 \node [yshift=2mm] at (1.north) {$r_1$};
 \node [yshift=2mm] at (2.north) {$r_2$};
 \end{pgfonlayer}
 \end{tikzpicture}
 \end{center}
-\noindent The first automaton has some accepting states. We
+\noindent The first NFA has some accepting states and the second some
-obtain an automaton for $r_1\cdot r_2$ by connecting these
+starting states. We obtain $\epsilon$NFA for $r_1\cdot r_2$ by
-accepting states with $\epsilon$-transitions to the starting
+connecting these accepting states with $\epsilon$-transitions to the
-state of the second automaton. By doing so we make them
+starting states of the second automaton. By doing so we make them
 non-accepting like so:
 \begin{center}
 \begin{tikzpicture}[node distance=3mm,
->=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
+>=stealth',very thick,
+every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
 \node[state, initial]  (Q_0)  {$\mbox{}$};
+\node[state, initial]  (Q_01) [below=1mm of Q_0] {$\mbox{}$};
+\node[state, initial]  (Q_02) [above=1mm of Q_0] {$\mbox{}$};
 \node (r_1)  [right=of Q_0] {$\ldots$};
 \node[state]  (t_1)  [right=of r_1] {$\mbox{}$};
 \node[state]  (t_2)  [above=of t_1] {$\mbox{}$};
 \node[state]  (t_3)  [below=of t_1] {$\mbox{}$};
-\node[state]  (a_0)  [right=2.5cm of t_1] {$\mbox{}$};
-\node (b_1)  [right=of a_0] {$\ldots$};
+\node  (A_0)  [right=2.5cm of t_1] {$\mbox{}$};
+\node[state]  (A_01)  [above=1mm of A_0] {$\mbox{}$};
+\node[state]  (A_02)  [below=1mm of A_0] {$\mbox{}$};
+\node (b_1)  [right=of A_0] {$\ldots$};
 \node[state, accepting]  (c_1)  [right=of b_1] {$\mbox{}$};
 \node[state, accepting]  (c_2)  [above=of c_1] {$\mbox{}$};
 \node[state, accepting]  (c_3)  [below=of c_1] {$\mbox{}$};
-\path[->] (t_1) edge node [above, pos=0.3]  {$\epsilon$} (a_0);
+\path[->] (t_1) edge (A_01);
-\path[->] (t_2) edge node [above]  {$\epsilon$} (a_0);
+\path[->] (t_2) edge node [above]  {$\epsilon$} (A_01);
-\path[->] (t_3) edge node [below]  {$\epsilon$} (a_0);
+\path[->] (t_3) edge (A_01);
+\path[->] (t_1) edge (A_02);
+\path[->] (t_2) edge (A_02);
+\path[->] (t_3) edge node [below]  {$\epsilon$} (A_02);
 \begin{pgfonlayer}{background}
-\node (3) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (Q_0) (c_1) (c_2) (c_3)] {};
+\node (3) [rounded corners, inner sep=1mm, thick,
+draw=black!60, fill=black!20, fit= (Q_0) (c_1) (c_2) (c_3)] {};
 \node [yshift=2mm] at (3.north) {$r_1\cdot r_2$};
 \end{pgfonlayer}
 \end{tikzpicture}
 \end{center}
 r_2$ is slightly different: We are given by recursion two
 automata representing $r_1$ and $r_2$ respectively.
 \begin{center}
 \begin{tikzpicture}[node distance=3mm,
->=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
+>=stealth',very thick,
+every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
 \node at (0,0)  (1)  {$\mbox{}$};
 \node[state, initial]  (2)  [above right=16mm of 1] {$\mbox{}$};
 \node[state, initial]  (3)  [below right=16mm of 1] {$\mbox{}$};
 \node (a)  [right=of 2] {$\ldots$};
 \noindent and connect its accepting states to a new starting
 state via $\epsilon$-transitions. This new starting state is
 also an accepting state, because $r^*$ can recognise the
 empty string. This gives the following automaton for $r^*$:
 \begin{center}
 \begin{tikzpicture}[node distance=3mm,
 >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
 \node at (0,0) [state, initial,accepting]  (1)  {$\mbox{}$};
 \node[state]  (2)  [right=16mm of 1] {$\mbox{}$};
 \node (a)  [right=of 2] {$\ldots$};
 \node [yshift=3mm] at (2.north) {$r^*$};
 \end{pgfonlayer}
 \end{tikzpicture}
 \end{center}
+{\small\begin{lstlisting}[language=Scala,linebackgroundcolor=
+{\ifodd\value{lstnumber}\color{capri!3}\fi}]
+def thompson (r: Rexp) : NFAt = r match {
+case ZERO => NFA_ZERO()
+case ONE => NFA_ONE()
+case CHAR(c) => NFA_CHAR(c)
+case ALT(r1, r2) => NFA_ALT(thompson(r1), thompson(r2))
+case SEQ(r1, r2) => NFA_SEQ(thompson(r1), thompson(r2))
+case STAR(r1) => NFA_STAR(thompson(r1))
+}
+\end{lstlisting}}
+\begin{center}
+\begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1mm}}c@{}}
+\begin{tikzpicture}
+\begin{axis}[
+title={Graph: $\texttt{(a*)*\,b}$ and strings
+$\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
+xlabel={$n$},
+x label style={at={(1.05,0.0)}},
+ylabel={time in secs},
+enlargelimits=false,
+xtick={0,5,...,30},
+xmax=33,
+ymax=35,
+ytick={0,5,...,30},
+scaled ticks=false,
+axis lines=left,
+width=5.5cm,
+height=4.5cm,
+legend entries={Python, Java},
+legend pos=outer north east,
+legend cell align=left]
+\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
+\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
+% depth-first search in NFAs
+\addplot[red,mark=*, mark options={fill=white}] table {
+1 0.00605
+2 0.03086
+3 0.11994
+4 0.45389
+5 2.06192
+6 8.04894
+7 32.63549
+};
+\end{axis}
+\end{tikzpicture}
+&
+\begin{tikzpicture}
+\begin{axis}[
+title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings
+$\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
+xlabel={$n$},
+x label style={at={(1.05,0.0)}},
+ylabel={time in secs},
+enlargelimits=false,
+xtick={0,5,...,30},
+xmax=33,
+ymax=35,
+ytick={0,5,...,30},
+scaled ticks=false,
+axis lines=left,
+width=5.5cm,
+height=4.5cm,
+legend entries={Python,Ruby},
+legend pos=south east,
+legend cell align=left]
+\addplot[blue,mark=*, mark options={fill=white}] table {re-python.data};
+\addplot[brown,mark=triangle*, mark options={fill=white}] table {re-ruby.data};
+% breath-first search in NFAs
+\addplot[red,mark=*, mark options={fill=white}] table {
+1  0.00741
+2  0.02657
+3  0.08317
+4  0.22133
+5  0.54463
+6  1.42062
+7  4.04926
+8  12.70395
+9  39.21555
+10  112.05466
+};
+\end{axis}
+\end{tikzpicture}
+\end{tabular}
+\end{center}
 \noindent This construction of a NFA from a regular expression
 was invented by Ken Thompson in 1968.
 \subsubsection*{Subset Construction}
-What is interesting is that for every NFA we can find a DFA
+Remember that we did not bother with defining and implementing
-which recognises the same language. This can, for example, be
+$\epsilon$NFA; we immediately translated them into equivalent
-done by the \emph{subset construction}. Consider again the NFA
+NFAs. Equivalent in the sense of accepting the same language (though
-below on the left, consisting of nodes labelled $0$, $1$ and $2$.
+we only claimed this and did not prove it rigorously). Remember also
+that NFAs have a non-deterministic transitions, given as a relation.
+This non-determinism makes it harder to decide when a string is
+accepted or not; such a decision is rather straightforward with DFAs
+(remember their transition function).
+What is interesting is that for every NFA we can find a DFA that also
+recognises the same language. This might sound like a bit paradoxical,
+but I litke to show you this next. There are a number of ways of
+transforming a NFA into an equivalent DFA, but the most famous is
+\emph{subset construction}. Consider again the NFA below on the left,
+consisting of nodes labelled, say, with $0$, $1$ and $2$.
 \begin{center}
 \begin{tabular}{c@{\hspace{10mm}}c}
 \begin{tikzpicture}[scale=0.7,>=stealth',very thick,
 every state/.style={minimum size=0pt,

changeset 488	598741d39d21
parent 487	a697421eaa04
child 489	e28d7a327870