afl-material: comparison handouts/ho03.tex

equal deleted inserted replaced

-:74149519e436
+:faba5360372c
 Java are so slow with certain regular expressions.
 \subsection*{Deterministic Finite Automata}
-Their definition is as follows:\medskip
+The central definition is:\medskip
 \noindent
 A \emph{deterministic finite automaton} (DFA), say $A$, is
-given by a four-tuple written $A(Qs, Q_0, F, \delta)$ where
+given by a five-tuple written $A(\Sigma, Qs, Q_0, F, \delta)$ where
 \begin{itemize}
+\item $\Sigma$ is an alphabet,
 \item $Qs$ is a finite set of states,
 \item $Q_0 \in Qs$ is the start state,
 \item $F \subseteq Qs$ are the accepting states, and
 \item $\delta$ is the transition function.
 \end{itemize}
 \noindent The transition function determines how to ``transition''
 from one state to the next state with respect to a character. We have
 the assumption that these transition functions do not need to be
 defined everywhere: so it can be the case that given a character there
 is no next state, in which case we need to raise a kind of ``failure
-exception''.  That means actually we have partial functions as
+exception''.  That means actually we have \emph{partial} functions as
-transitions---see our implementation later on.  A typical example of a
+transitions---see the implementation later on.  A typical example of a
 DFA is
 \begin{center}
 \begin{tikzpicture}[>=stealth',very thick,auto,
 every state/.style={minimum size=0pt,
 end up in an accepting state, then this string is accepted by the
 automaton. Otherwise it is not accepted. This also means that if along
 the way we hit the case where the transition function $\delta$ is not
 defined, we need to raise an error. In our implementation we will deal
 with this case elegantly by using Scala's \texttt{Try}. So a string
-$s$ is in the \emph{language accepted by the automaton} $A(Q, Q_0, F,
+$s$ is in the \emph{language accepted by the automaton} $A(\Sigma, Q, Q_0, F,
 \delta)$ iff
 \[
 \hat{\delta}(Q_0, s) \in F
 \]
 \noindent I let you think about a definition that describes
-the set of strings accepted by an automaton.
+the set of all strings accepted by an automaton.
 \begin{figure}[p]
 \small
 \lstinputlisting[numbers=left,linebackgroundcolor=
 {\ifodd\value{lstnumber}\color{capri!3}\fi}]
 {../progs/dfa.scala}
 \caption{A Scala implementation of DFAs using partial functions.
 Notice some subtleties: \texttt{deltas} implements the delta-hat
 construction by lifting the transition (partial) function to
-lists of \texttt{C}haracters. Since \texttt{delta} is given
+lists of characters. Since \texttt{delta} is given
-as partial function, it can obviously go ``wrong'' in which
+as a partial function, it can obviously go ``wrong'' in which
 case the \texttt{Try} in \texttt{accepts} catches the error and
 returns \texttt{false}---that means the string is not accepted.
 The example \texttt{delta} implements the DFA example shown
 earlier in the handout.\label{dfa}}
 \end{figure}
 A simple Scala implementation for DFAs is given in
-Figure~\ref{dfa}. As you can see, there many features of the
+Figure~\ref{dfa}. As you can see, there are many features of the
 mathematical definition that are quite closely reflected in the
 code. In the DFA-class, there is a starting state, called
-\texttt{start}, with the polymorphic type \texttt{A}. (The reason for
+\texttt{start}, with the polymorphic type \texttt{A}.  There is a
-having a polymorphic types for the states and the input of DFAs will
+partial function \texttt{delta} for specifying the transitions---these
-become clearer later.)  There is a partial function \texttt{delta} for
+partial functions take a state (of polymorphic type \texttt{A}) and an
-specifying the transitions. I used partial functions for representing
+input (of polymorphic type \texttt{C}) and produce a new state (of
-the transitions in Scala; alternatives would have been \texttt{Maps}
+type \texttt{A}). For the moment it is OK to assume that \texttt{A} is
-or even \texttt{Lists}. One of the main advantages of using partial
+some arbitrary type for states and the input is just characters.  (The
-functions is that transitions can be quite nicely defined by a series
+reason for having a polymorphic types for the states and the input of
-of \texttt{case} statements (see Lines 28 -- 38). If you need to
+DFAs will become clearer later on.)
-represent an automaton with a sink state (catch-all-state), you can
-write something like
+I used partial functions for representing the transitions in Scala;
+alternatives would have been \texttt{Maps} or even \texttt{Lists}. One
+of the main advantages of using partial functions is that transitions
+can be quite nicely defined by a series of \texttt{case} statements
+(see Lines 28 -- 38 for an example). If you need to represent an
+automaton with a sink state (catch-all-state), you can use Scala's
+pattern matching and write something like
 {\small\begin{lstlisting}[language=Scala,linebackgroundcolor=
 {\ifodd\value{lstnumber}\color{capri!3}\fi}]
 abstract class State
 ...
 case _ => Sink
 }
 \end{lstlisting}}
 \noindent The DFA-class has also an argument for specifying final
-states. In the implementation it is a function from states to booleans
+states. In the implementation it not a set of states, as in the
-(returns true whenever a state is meant to be final; false
+matemathical definition, but is a function from states to booleans
-otherwise). While this boolean function is different from the sets of
+(this function is supposed to return true whenever a state is meant to
-states used in the mathematical definition, Scala allows me to use
+be final; false otherwise). While this boolean function is different
-sets for such functions (see Line 40 where I initialise the DFA).
+from the sets of states, Scala allows to use sets for such functions
+(see Line 40 where the DFA is initialised). Again it will become clear
+later on why I use functions for final states, rather than sets.
 I let you ponder whether this is a good implementation of DFAs. In
 doing so I hope you notice that the $Qs$ component (the set of finite
 states) is missing from the class definition. This means that the
 implementation allows you to do some fishy things you are not meant to
 While with DFAs it will always be clear that given a state and a
 character what the next state is (potentially none), it will be useful
 to relax this restriction. That means we allow to have several
 potential successor states. We even allow more than one starting
 state. The resulting construction is called a \emph{non-deterministic
-finite automaton} (NFA) given also as a four-tuple $A(Qs, Q_{0s}, F,
+finite automaton} (NFA) given also as a five-tuple $A(\Sigma, Qs, Q_{0s}, F,
 \rho)$ where
 \begin{itemize}
+\item $\Sigma$ is an alphabet,
 \item $Qs$ is a finite set of states
 \item $Q_{0s}$ is a set of start states ($Q_{0s} \subseteq Qs$)
 \item $F$ are some accepting states with $F \subseteq Qs$, and
 \item $\rho$ is a transition relation.
 \end{itemize}
 \noindent
-A typical example of an NFA is
+A typical example of a NFA is
 % A NFA for (ab* + b)*a
 \begin{center}
-\begin{tikzpicture}[scale=0.8,>=stealth',very thick,
+\begin{tikzpicture}[>=stealth',very thick, auto,
-every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]
+every state/.style={minimum size=0pt,inner sep=3pt,
+draw=blue!50,very thick,fill=blue!20},scale=2]
 \node[state,initial]  (Q_0)  {$Q_0$};
 \node[state] (Q_1) [right=of Q_0] {$Q_1$};
 \node[state, accepting] (Q_2) [right=of Q_1] {$Q_2$};
 \path[->] (Q_0) edge [loop above] node  {$b$} ();
 \path[<-] (Q_0) edge node [below]  {$b$} (Q_1);
 \end{center}
 \noindent
 This NFA happens to have only one starting state, but in general there
 could be more.  Notice that in state $Q_0$ we might go to state $Q_1$
-\emph{or} to state $Q_2$ when receiving an $a$. Similarly in state $Q_1$
+\emph{or} to state $Q_2$ when receiving an $a$. Similarly in state
-and receiving an $a$, we can stay in state $Q_1$ or go to $Q_2$. This
+$Q_1$ and receiving an $a$, we can stay in state $Q_1$ or go to
-kind of choice is not allowed with DFAs. When it comes to deciding
+$Q_2$. This kind of choice is not allowed with DFAs. The downside is
-whether a string is accepted by an NFA we potentially need to explore
+that when it comes to deciding whether a string is accepted by a NFA
-all possibilities. I let you think which kind of strings this NFA
+we potentially have to explore all possibilities. I let you think
-accepts.
+which kind of strings the above NFA accepts.
-There are however a number of additional points you should note. Every
+There are a number of additional points you should note with
-DFA is a NFA, but not vice versa. The $\rho$ in NFAs is a transition
+NFAs. Every DFA is a NFA, but not vice versa. The $\rho$ in NFAs is a
-\emph{relation} (DFAs have a transition function). The difference
+transition \emph{relation} (DFAs have a transition function). The
-between a function and a relation is that a function has always a
+difference between a function and a relation is that a function has
-single output, while a relation gives, roughly speaking, several
+always a single output, while a relation gives, roughly speaking,
-outputs. Look again at the NFA above: if you are currently in the
+several outputs. Look again at the NFA above: if you are currently in
-state $Q_1$ and you read a character $b$, then you can transition to
+the state $Q_1$ and you read a character $b$, then you can transition
-either $Q_0$ \emph{or} $Q_2$. Which route, or output, you take is not
+to either $Q_0$ \emph{or} $Q_2$. Which route, or output, you take is
-determined.  This non-determinism can be represented by a relation.
+not determined.  This non-determinism can be represented by a
+relation.
-My implementation of NFAs is shown in Figure~\ref{nfa}.  Perhaps
-interestingly, I do not use relations for my implementation of NFAs in
+My implementation of NFAs in Scala is shown in Figure~\ref{nfa}.
-Scala, and I also do not use transition functions which return a set
+Perhaps interestingly, I do not actually use relations for my NFAs,
-of states (another popular choice for implementing NFAs).  For reasons
+and I also do not use transition functions that return sets of states
-that become clear in a moment, I use sets of partial functions---see
+(another popular choice for implementing NFAs).  For reasons that
-Line 7 in Figure~\ref{nfa}. \texttt{Starts} is now a set of states;
+become clear in a moment, I use sets of partial functions
-\texttt{fins} is again a function from states to booleans. The
+instead---see Line 7 in Figure~\ref{nfa}. DFAs have only one such
-\texttt{next} function calculates the set of next states reachable
+partial function; my NFAs have a set.  Another parameter,
-from a state \texttt{q} by a character \texttt{c}---we go through all
+\texttt{Starts}, is in NFAs a set of states; \texttt{fins} is again a
-partial functions in the \texttt{delta}-set, lift it (this transformes
+function from states to booleans. The \texttt{next} function
-the partial function into the corresponding \texttt{Option}-function
+calculates the set of next states reachable from a single state
-and then apply \texttt{q} and \texttt{c}. This gives us a set of
+\texttt{q} by a character \texttt{c}---this is calculated by going
-\texttt{Some}s and \texttt{None}s, where the \texttt{Some}s are
+through all the partial functions in the \texttt{delta}-set and apply
-filtered out by the \texttt{flatMap}. Teh function \texttt{nexts} just
+\texttt{q} and \texttt{c} (see Line 13). This gives us a set of
-lifts this to sets of states. \texttt{Deltas} and \texttt{accept} are
+\texttt{Some}s (in case the application succeeded) and possibly some
-similar to the DFA definitions.
+\texttt{None}s (in case the partial function is not defined or produces an
+error).  The \texttt{None}s are filtered out by the \texttt{flatMap},
+leaving the values inside the \texttt{Some}s. The function
+\texttt{nexts} just lifts this function to sets of
+states. \texttt{Deltas} and \texttt{accept} are similar to the DFA
+definitions.
 \begin{figure}[t]
 \small
 \lstinputlisting[numbers=left,linebackgroundcolor=
 {\ifodd\value{lstnumber}\color{capri!3}\fi}]
 {../progs/nfa.scala}
 \caption{A Scala implementation of NFAs using sets of partial functions.
-Notice some subtleties: \texttt{deltas} implements the delta-hat
+Notice some subtleties: Since \texttt{delta} is given
-construction by lifting the transition (partial) function to
+as a set of partial functions, each of them can obviously go ``wrong'' in which
-lists of \texttt{C}haracters. Since \texttt{delta} is given
+case the \texttt{Try}. The function \texttt{accepts} implements the
-as partial function, it can obviously go ``wrong'' in which
+acceptance of a string in a breath-first fashion. This can be costly
-case the \texttt{Try} in \texttt{accepts} catches the error and
+way of deciding whether a string is accepted in practical contexts.\label{nfa}}
-returns \texttt{false}---that means the string is not accepted.
-The example \texttt{delta} implements the DFA example shown
-earlier in the handout.\label{nfa}}
 \end{figure}
+The reason for using sets of partial functions for specifying the
+transitions in NFAs has to do with examples like this one: a
+popular benchmark regular expression is $(.)^*\cdot a\cdot
+(.)^{\{n\}}\cdot b\cdot c$. A NFA that accepts the same strings
+(for $n=3$) is as follows:
+\begin{center}
+\begin{tikzpicture}[>=stealth',very thick, auto, node distance=7mm,
+every state/.style={minimum size=0pt,inner sep=1pt,
+draw=blue!50,very thick,fill=blue!20},scale=0.5]
+\node[state,initial]  (Q_0)  {$Q_0$};
+\node[state] (Q_1) [right=of Q_0] {$Q_1$};
+\node[state] (Q_2) [right=of Q_1] {$Q_2$};
+\node[state] (Q_3) [right=of Q_2] {$Q_3$};
+\node[state] (Q_4) [right=of Q_3] {$Q_4$};
+\node[state] (Q_5) [right=of Q_4] {$Q_5$};
+\node[state,accepting] (Q_6) [right=of Q_5] {$Q_6$};
+\path[->] (Q_0) edge [loop above] node  {$.$} ();
+\path[->] (Q_0) edge node [above]  {$a$} (Q_1);
+\path[->] (Q_1) edge node [above]  {$.$} (Q_2);
+\path[->] (Q_2) edge node [above]  {$.$} (Q_3);
+\path[->] (Q_3) edge node [above]  {$.$} (Q_4);
+\path[->] (Q_4) edge node [above]  {$b$} (Q_5);
+\path[->] (Q_5) edge node [above]  {$c$} (Q_6);
+\end{tikzpicture}
+\end{center}
+\noindent
+The $.$ stands for accepting any single character: for example if we
+are in $Q_0$ and read an $a$ we can either stay in $Q_0$ (since any
+character will do for this) or advance to $Q_1$ (but only if it is an
+$a$). Why this is a good benchmark regular expression is irrelevant
+here. The point is that this NFA can be conveniently represented by
+the code:
+{\small\begin{lstlisting}[language=Scala,linebackgroundcolor=
+{\ifodd\value{lstnumber}\color{capri!3}\fi}]
+val delta = Set[(State, Char) :=> State](
+{ case (Q0, 'a') => Q1 },
+{ case (Q0, _)   => Q0 },
+{ case (Q1, _)   => Q2 },
+{ case (Q2, _)   => Q3 },
+{ case (Q3, _)   => Q4 },
+{ case (Q4, 'b') => Q5 },
+{ case (Q5, 'c') => Q6 }
+)
+NFA(Set[State](Q0), delta, Set[State](Q6))
+\end{lstlisting}}
+\noindent
+where the $.$-transitions translate into a
+underscore-pattern-matching. Recall that in $Q_0$ if we read an $a$ we
+can go to $Q_1$ (by the first partial function in the set) and also
+stay in $Q_0$ (by the second partial function). Representing such
+transitions in any other way is somehow awkward; the set of partial
+function representation makes them easy to implement.
+Look very careful again at the \texttt{accepts} and \texttt{deltas}
+functions in NFAs and remember that when accepting a string by an NFA
+we might have to explore all possible transitions (which state to go
+to is not unique anymore). The shown implementation achieves this
+exploration in a \emph{breath-first search}. This is fine for very
+small NFAs, but can lead to problems when the NFAs are bigger. Take
+for example the regular expression $(.)^*\cdot a\cdot (.)^{\{n\}}\cdot
+b\cdot c$ from above. If $n$ is large, say 100 or 1000, then the
+corresponding NFA will have 104, respectively 1004, nodes. The problem
+is that with certain strings this can lead to 1000 ``active'' nodes
+which we need to analyse when determining the next states. This can be
+a real memory strain in practical applications. As a result, some
+regular expression matching engines resort to a \emph{depth-first
+search} with \emph{backtracking} in unsuccessful cases. In our
+implementation we could implement a depth-first version of
+\texttt{accepts} using Scala's \texttt{exists} as follows:
+{\small\begin{lstlisting}[language=Scala,linebackgroundcolor=
+{\ifodd\value{lstnumber}\color{capri!3}\fi}]
+def search(q: A, s: List[C]) : Boolean = s match {
+case Nil => fins(q)
+case c::cs =>
+delta.exists((d) => Try(search(d(q, c), cs)) getOrElse false)
+}
+def accepts(s: List[C]) : Boolean =
+starts.exists(search(_, s))
+\end{lstlisting}}
+\noindent
+This depth-first way of exploration seems to work efficiently in many
+examples and is much less of strain on memory. The problem is that the
+backtracking can get catastrophic in some examples---remember the
+catastrophic backtracking from earlier lectures. This depth-first
+search with backtracking is the reason for the abysmal performance of
+regular expression macthing in Java, Ruby and Python.
 %This means if
 %we need to decide whether a string is accepted by a NFA, we might have
 %to explore all possibilities. Also there is the special silent
 %transition in NFAs. As mentioned already this transition means you do
 for the purposes of this module, we omit this.
 \subsubsection*{Automata Minimization}
 As seen in the subset construction, the translation
-of an NFA to a DFA can result in a rather ``inefficient''
+of a NFA to a DFA can result in a rather ``inefficient''
 DFA. Meaning there are states that are not needed. A
 DFA can be \emph{minimised} by the following algorithm:
 \begin{enumerate}
 \item Take all pairs $(q, p)$ with $q \not= p$
 with the size of the regular expression. The problem with NFAs
 is that the problem of deciding whether a string is accepted
 or not is computationally not cheap. Remember with NFAs we
 have potentially many next states even for the same input and
 also have the silent $\epsilon$-transitions. If we want to
-find a path from the starting state of an NFA to an accepting
+find a path from the starting state of a NFA to an accepting
 state, we need to consider all possibilities. In Ruby and
 Python this is done by a depth-first search, which in turn
 means that if a ``wrong'' choice is made, the algorithm has to
 backtrack and thus explore all potential candidates. This is
 exactly the reason why Ruby and Python are so slow for evil
 regular expressions because they are useful for recognising
 comments. But in principle we did not need to. The argument
 for this is as follows: take a regular expression, translate
 it into a NFA and then a DFA that both recognise the same
 language. Once you have the DFA it is very easy to construct
-the automaton for the language not recognised by an DFA. If
+the automaton for the language not recognised by a DFA. If
 the DFA is completed (this is important!), then you just need
 to exchange the accepting and non-accepting states. You can
 then translate this DFA back into a regular expression and
 that will be the regular expression that can match all strings
 the original regular expression could \emph{not} match.

changeset 483	faba5360372c
parent 482	74149519e436
child 484	8182eb3278e0