afl-material: comparison handouts/ho03.tex

equal deleted inserted replaced

-:6f508bcdaa30
+:e61ffb28994d
 \documentclass{article}
 \usepackage{../style}
 \usepackage{../langs}
 \usepackage{../graphics}
 %We can even allow ``silent
 %transitions'', also called epsilon-transitions. They allow us
 %to go from one state to the next without having a character
 %consumed. We label such silent transition with the letter
 The central definition is:\medskip
 \noindent
 A \emph{deterministic finite automaton} (DFA), say $A$, is
-given by a five-tuple written $A(\Sigma, Qs, Q_0, F, \delta)$ where
+given by a five-tuple written ${\cal A}(\varSigma, Qs, Q_0, F, \delta)$ where
 \begin{itemize}
-\item $\Sigma$ is an alphabet,
+\item $\varSigma$ is an alphabet,
 \item $Qs$ is a finite set of states,
 \item $Q_0 \in Qs$ is the start state,
 \item $F \subseteq Qs$ are the accepting states, and
 \item $\delta$ is the transition function.
 \end{itemize}
 from one state to the next state with respect to a character. We have
 the assumption that these transition functions do not need to be
 defined everywhere: so it can be the case that given a character there
 is no next state, in which case we need to raise a kind of ``failure
 exception''.  That means actually we have \emph{partial} functions as
-transitions---see the implementation later on.  A typical example of a
+transitions---see the Scala implementation of DFAs later on.  A
-DFA is
+typical example of a DFA is
 \begin{center}
 \begin{tikzpicture}[>=stealth',very thick,auto,
 every state/.style={minimum size=0pt,
 inner sep=2pt,draw=blue!50,very thick,
 an automaton. For this we lift the transition function
 $\delta$ from characters to strings as follows:
 \[
 \begin{array}{lcl}
-\hat{\delta}(q, [])        & \dn & q\\
+\widehat{\delta}(q, [])        & \dn & q\\
-\hat{\delta}(q, c\!::\!s) & \dn & \hat{\delta}(\delta(q, c), s)\\
+\widehat{\delta}(q, c\!::\!s) & \dn & \widehat{\delta}(\delta(q, c), s)\\
 \end{array}
 \]
 \noindent This lifted transition function is often called
 ``delta-hat''. Given a string, we start in the starting state and take
-the first character of the string, follow to the next sate, then take
+the first character of the string, follow to the next state, then take
 the second character and so on. Once the string is exhausted and we
 end up in an accepting state, then this string is accepted by the
 automaton. Otherwise it is not accepted. This also means that if along
 the way we hit the case where the transition function $\delta$ is not
 defined, we need to raise an error. In our implementation we will deal
 with this case elegantly by using Scala's \texttt{Try}. So a string
-$s$ is in the \emph{language accepted by the automaton} $A(\Sigma, Q, Q_0, F,
+$s$ is in the \emph{language accepted by the automaton} ${\cal
-\delta)$ iff
+A}(\varSigma, Q, Q_0, F, \delta)$ iff
 \[
-\hat{\delta}(Q_0, s) \in F
+\widehat{\delta}(Q_0, s) \in F
 \]
 \noindent I let you think about a definition that describes
 the set of all strings accepted by an automaton.
 returns \texttt{false}---that means the string is not accepted.
 The example \texttt{delta} implements the DFA example shown
 earlier in the handout.\label{dfa}}
 \end{figure}
-A simple Scala implementation for DFAs is given in
+My take of a simple Scala implementation for DFAs is given in
 Figure~\ref{dfa}. As you can see, there are many features of the
 mathematical definition that are quite closely reflected in the
 code. In the DFA-class, there is a starting state, called
 \texttt{start}, with the polymorphic type \texttt{A}.  There is a
 partial function \texttt{delta} for specifying the transitions---these
 partial functions take a state (of polymorphic type \texttt{A}) and an
 input (of polymorphic type \texttt{C}) and produce a new state (of
 type \texttt{A}). For the moment it is OK to assume that \texttt{A} is
 some arbitrary type for states and the input is just characters.  (The
-reason for having a polymorphic types for the states and the input of
+reason for having polymorphic types for the states and the input of
 DFAs will become clearer later on.)
-I used partial functions for representing the transitions in Scala;
+The most important point in this implemnetation is that I use Scala's
-alternatives would have been \texttt{Maps} or even \texttt{Lists}. One
+partial functions for representing the transitions; alternatives would
-of the main advantages of using partial functions is that transitions
+have been \texttt{Maps} or even \texttt{Lists}. One of the main
-can be quite nicely defined by a series of \texttt{case} statements
+advantages of using partial functions is that transitions can be quite
-(see Lines 28 -- 38 for an example). If you need to represent an
+nicely defined by a series of \texttt{case} statements (see Lines 28
-automaton with a sink state (catch-all-state), you can use Scala's
+-- 38 for an example). If you need to represent an automaton with a
-pattern matching and write something like
+sink state (catch-all-state), you can use Scala's pattern matching and
+write something like
 {\small\begin{lstlisting}[language=Scala,linebackgroundcolor=
 {\ifodd\value{lstnumber}\color{capri!3}\fi}]
 abstract class State
 ...
 case (S1, 'a') => S2
 case _ => Sink
 }
 \end{lstlisting}}
-\noindent The DFA-class has also an argument for specifying final
+\noindent I let you think what this DFA looks like in the graphical
-states. In the implementation it not a set of states, as in the
+notation.
-matemathical definition, but is a function from states to booleans
-(this function is supposed to return true whenever a state is meant to
+The DFA-class has also an argument for specifying final states. In the
-be final; false otherwise). While this boolean function is different
+implementation it not a set of states, as in the matemathical
-from the sets of states, Scala allows to use sets for such functions
+definition, but a function from states to booleans (this function is
-(see Line 40 where the DFA is initialised). Again it will become clear
+supposed to return true whenever a state is final; false
-later on why I use functions for final states, rather than sets.
+otherwise). While this boolean function is different from the sets of
+states, Scala allows to use sets for such functions (see Line 40 where
+the DFA is initialised). Again it will become clear later on why I use
+functions for final states, rather than sets.
 I let you ponder whether this is a good implementation of DFAs. In
-doing so I hope you notice that the $Qs$ component (the set of finite
+doing so I hope you notice that the $\varSigma$ and $Qs$ components (the
-states) is missing from the class definition. This means that the
+alphabet and the set of finite states, respectively) are missing from
-implementation allows you to do some fishy things you are not meant to
+the class definition. This means that the implementation allows you to
-do with DFAs. Which fishy things could that be?
+do some fishy things you are not meant to do with DFAs. Which fishy
+things could that be?
 \subsection*{Non-Deterministic Finite Automata}
-While with DFAs it will always be clear that given a state and a
+While with DFAs it is always be clear that given a state and a
 character what the next state is (potentially none), it will be useful
-to relax this restriction. That means we allow to have several
+to relax this restriction. That means we allow states to have several
 potential successor states. We even allow more than one starting
-state. The resulting construction is called a \emph{non-deterministic
+state. The resulting construction is called a \emph{Non-Deterministic
-finite automaton} (NFA) given also as a five-tuple $A(\Sigma, Qs, Q_{0s}, F,
+Finite Automaton} (NFA) given also as a five-tuple ${\cal A}(\varSigma,
-\rho)$ where
+Qs, Q_{0s}, F, \rho)$ where
 \begin{itemize}
-\item $\Sigma$ is an alphabet,
+\item $\varSigma$ is an alphabet,
 \item $Qs$ is a finite set of states
 \item $Q_{0s}$ is a set of start states ($Q_{0s} \subseteq Qs$)
 \item $F$ are some accepting states with $F \subseteq Qs$, and
 \item $\rho$ is a transition relation.
 \end{itemize}
 \noindent
 This NFA happens to have only one starting state, but in general there
 could be more.  Notice that in state $Q_0$ we might go to state $Q_1$
 \emph{or} to state $Q_2$ when receiving an $a$. Similarly in state
-$Q_1$ and receiving an $a$, we can stay in state $Q_1$ or go to
+$Q_1$ and receiving an $a$, we can stay in state $Q_1$ \emph{or} go to
-$Q_2$. This kind of choice is not allowed with DFAs. The downside is
+$Q_2$. This kind of choice is not allowed with DFAs. The downside of
-that when it comes to deciding whether a string is accepted by a NFA
+this choice is that when it comes to deciding whether a string is
-we potentially have to explore all possibilities. I let you think
+accepted by a NFA we potentially have to explore all possibilities. I
-which kind of strings the above NFA accepts.
+let you think which kind of strings the above NFA accepts.
 There are a number of additional points you should note with
 NFAs. Every DFA is a NFA, but not vice versa. The $\rho$ in NFAs is a
 transition \emph{relation} (DFAs have a transition function). The
 and I also do not use transition functions that return sets of states
 (another popular choice for implementing NFAs).  For reasons that
 become clear in a moment, I use sets of partial functions
 instead---see Line 7 in Figure~\ref{nfa}. DFAs have only one such
 partial function; my NFAs have a set.  Another parameter,
-\texttt{Starts}, is in NFAs a set of states; \texttt{fins} is again a
+\texttt{starts}, is in NFAs a set of states; \texttt{fins} is again a
 function from states to booleans. The \texttt{next} function
 calculates the set of next states reachable from a single state
-\texttt{q} by a character \texttt{c}---this is calculated by going
+\texttt{q} by a character~\texttt{c}---this is calculated by going
 through all the partial functions in the \texttt{delta}-set and apply
-\texttt{q} and \texttt{c} (see Line 13). This gives us a set of
+\texttt{q} and \texttt{c} (see Line 13). This gives a set of
 \texttt{Some}s (in case the application succeeded) and possibly some
 \texttt{None}s (in case the partial function is not defined or produces an
 error).  The \texttt{None}s are filtered out by the \texttt{flatMap},
 leaving the values inside the \texttt{Some}s. The function
 \texttt{nexts} just lifts this function to sets of
 states. \texttt{Deltas} and \texttt{accept} are similar to the DFA
 definitions.
-\begin{figure}[t]
+\begin{figure}[p]
 \small
 \lstinputlisting[numbers=left,linebackgroundcolor=
 {\ifodd\value{lstnumber}\color{capri!3}\fi}]
 {../progs/nfa.scala}
 \caption{A Scala implementation of NFAs using sets of partial functions.
 acceptance of a string in a breath-first fashion. This can be costly
 way of deciding whether a string is accepted in practical contexts.\label{nfa}}
 \end{figure}
 The reason for using sets of partial functions for specifying the
-transitions in NFAs has to do with examples like this one: a
+transitions in NFAs has to do with pattern matching. Consider the
-popular benchmark regular expression is $(.)^*\cdot a\cdot
+following example: a popular benchmark regular expression is
-(.)^{\{n\}}\cdot b\cdot c$. A NFA that accepts the same strings
+$(.)^*\cdot a\cdot (.)^{\{n\}}\cdot b\cdot c$. The important point to
-(for $n=3$) is as follows:
+note is that it uses $.$ in order to represent the regular expression
+that accepts any character. A NFA that accepts the same strings as
+this regular expression (for $n=3$) is as follows:
 \begin{center}
 \begin{tikzpicture}[>=stealth',very thick, auto, node distance=7mm,
 every state/.style={minimum size=0pt,inner sep=1pt,
 draw=blue!50,very thick,fill=blue!20},scale=0.5]
 \path[->] (Q_5) edge node [above]  {$c$} (Q_6);
 \end{tikzpicture}
 \end{center}
 \noindent
-The $.$ stands for accepting any single character: for example if we
+Also here the $.$ stands for accepting any single character: for example if we
 are in $Q_0$ and read an $a$ we can either stay in $Q_0$ (since any
 character will do for this) or advance to $Q_1$ (but only if it is an
 $a$). Why this is a good benchmark regular expression is irrelevant
 here. The point is that this NFA can be conveniently represented by
 the code:
 \noindent
 where the $.$-transitions translate into a
 underscore-pattern-matching. Recall that in $Q_0$ if we read an $a$ we
 can go to $Q_1$ (by the first partial function in the set) and also
 stay in $Q_0$ (by the second partial function). Representing such
-transitions in any other way is somehow awkward; the set of partial
+transitions in any other way in Scala seems to be somehow awkward; the
-function representation makes them easy to implement.
+set of partial function representation makes them easy to implement.
 Look very careful again at the \texttt{accepts} and \texttt{deltas}
 functions in NFAs and remember that when accepting a string by an NFA
-we might have to explore all possible transitions (which state to go
+we might have to explore all possible transitions (recall which state
-to is not unique anymore). The shown implementation achieves this
+to go to is not unique anymore with NFAs). The implementation achieves
-exploration in a \emph{breath-first search}. This is fine for very
+this exploration in a \emph{breadth-first search} manner. This is fine
-small NFAs, but can lead to problems when the NFAs are bigger. Take
+for very small NFAs, but can lead to problems when the NFAs are
-for example the regular expression $(.)^*\cdot a\cdot (.)^{\{n\}}\cdot
+bigger. Take for example the regular expression $(.)^*\cdot a\cdot
-b\cdot c$ from above. If $n$ is large, say 100 or 1000, then the
+(.)^{\{n\}}\cdot b\cdot c$ from above. If $n$ is large, say 100 or
-corresponding NFA will have 104, respectively 1004, nodes. The problem
+1000, then the corresponding NFA will have 104, respectively 1004,
-is that with certain strings this can lead to 1000 ``active'' nodes
+nodes. The problem is that with certain strings this can lead to 1000
-which we need to analyse when determining the next states. This can be
+``active'' nodes in the breadth-first search, all of which we need to
-a real memory strain in practical applications. As a result, some
+analyse when determining the next states. This can be a real memory
-regular expression matching engines resort to a \emph{depth-first
+strain in practical applications. As result, some regular expression
-search} with \emph{backtracking} in unsuccessful cases. In our
+matching engines resort to a \emph{depth-first search} with
-implementation we could implement a depth-first version of
+\emph{backtracking} in unsuccessful cases. In our implementation we
-\texttt{accepts} using Scala's \texttt{exists} as follows:
+can implement a depth-first version of \texttt{accepts} using Scala's
+\texttt{exists} as follows:
 {\small\begin{lstlisting}[language=Scala,linebackgroundcolor=
 {\ifodd\value{lstnumber}\color{capri!3}\fi}]
 def search(q: A, s: List[C]) : Boolean = s match {
 case Nil => fins(q)
 case c::cs =>
-delta.exists((d) => Try(search(d(q, c), cs)) getOrElse false)
+delta.exists(d => Try(search(d(q, c), cs)) getOrElse false)
 }
 def accepts(s: List[C]) : Boolean =
 starts.exists(search(_, s))
 \end{lstlisting}}
 \noindent
 This depth-first way of exploration seems to work efficiently in many
 examples and is much less of strain on memory. The problem is that the
-backtracking can get catastrophic in some examples---remember the
+backtracking can get ``catastrophic'' in some examples---remember the
 catastrophic backtracking from earlier lectures. This depth-first
 search with backtracking is the reason for the abysmal performance of
-regular expression macthing in Java, Ruby and Python.
+some regular expression macthings in Java, Ruby and Python. I like to
+show you this next.
 %This means if
 %we need to decide whether a string is accepted by a NFA, we might have
 %to explore all possibilities. Also there is the special silent
 %transition in NFAs. As mentioned already this transition means you do
 %%$Q_2$. This silent transition is also often called
 %\emph{$\epsilon$-transition}.
 \subsubsection*{Thompson Construction}
+In order to get an idea what calculations are done in Java \& friends,
+we need a method for translating regular expressions into
+automata. The simplest and most well-known method is called
+\emph{Thompson Construction}, after the Turing Award winner Ken
+Thompson who implemented this method in early versions of grep????
 The reason for introducing NFAs is that there is a relatively
 simple (recursive) translation of regular expressions into
 NFAs. Consider the simple regular expressions $\ZERO$,
 $\ONE$ and $c$. They can be translated as follows:

changeset 484	e61ffb28994d
parent 483	6f508bcdaa30
child 485	bd6d999ab7b6