--- a/handouts/ho03.tex Tue Apr 11 06:22:46 2017 +0800
+++ b/handouts/ho03.tex Sat Apr 15 22:03:59 2017 +0800
@@ -3,7 +3,6 @@
\usepackage{../langs}
\usepackage{../graphics}
-
%We can even allow ``silent
%transitions'', also called epsilon-transitions. They allow us
%to go from one state to the next without having a character
@@ -33,10 +32,10 @@
\noindent
A \emph{deterministic finite automaton} (DFA), say $A$, is
-given by a five-tuple written $A(\Sigma, Qs, Q_0, F, \delta)$ where
+given by a five-tuple written ${\cal A}(\varSigma, Qs, Q_0, F, \delta)$ where
\begin{itemize}
-\item $\Sigma$ is an alphabet,
+\item $\varSigma$ is an alphabet,
\item $Qs$ is a finite set of states,
\item $Q_0 \in Qs$ is the start state,
\item $F \subseteq Qs$ are the accepting states, and
@@ -49,8 +48,8 @@
defined everywhere: so it can be the case that given a character there
is no next state, in which case we need to raise a kind of ``failure
exception''. That means actually we have \emph{partial} functions as
-transitions---see the implementation later on. A typical example of a
-DFA is
+transitions---see the Scala implementation of DFAs later on. A
+typical example of a DFA is
\begin{center}
\begin{tikzpicture}[>=stealth',very thick,auto,
@@ -102,25 +101,25 @@
\[
\begin{array}{lcl}
-\hat{\delta}(q, []) & \dn & q\\
-\hat{\delta}(q, c\!::\!s) & \dn & \hat{\delta}(\delta(q, c), s)\\
+\widehat{\delta}(q, []) & \dn & q\\
+\widehat{\delta}(q, c\!::\!s) & \dn & \widehat{\delta}(\delta(q, c), s)\\
\end{array}
\]
\noindent This lifted transition function is often called
``delta-hat''. Given a string, we start in the starting state and take
-the first character of the string, follow to the next sate, then take
+the first character of the string, follow to the next state, then take
the second character and so on. Once the string is exhausted and we
end up in an accepting state, then this string is accepted by the
automaton. Otherwise it is not accepted. This also means that if along
the way we hit the case where the transition function $\delta$ is not
defined, we need to raise an error. In our implementation we will deal
with this case elegantly by using Scala's \texttt{Try}. So a string
-$s$ is in the \emph{language accepted by the automaton} $A(\Sigma, Q, Q_0, F,
-\delta)$ iff
+$s$ is in the \emph{language accepted by the automaton} ${\cal
+ A}(\varSigma, Q, Q_0, F, \delta)$ iff
\[
-\hat{\delta}(Q_0, s) \in F
+\widehat{\delta}(Q_0, s) \in F
\]
\noindent I let you think about a definition that describes
@@ -142,7 +141,7 @@
earlier in the handout.\label{dfa}}
\end{figure}
-A simple Scala implementation for DFAs is given in
+My take of a simple Scala implementation for DFAs is given in
Figure~\ref{dfa}. As you can see, there are many features of the
mathematical definition that are quite closely reflected in the
code. In the DFA-class, there is a starting state, called
@@ -152,16 +151,17 @@
input (of polymorphic type \texttt{C}) and produce a new state (of
type \texttt{A}). For the moment it is OK to assume that \texttt{A} is
some arbitrary type for states and the input is just characters. (The
-reason for having a polymorphic types for the states and the input of
+reason for having polymorphic types for the states and the input of
DFAs will become clearer later on.)
-I used partial functions for representing the transitions in Scala;
-alternatives would have been \texttt{Maps} or even \texttt{Lists}. One
-of the main advantages of using partial functions is that transitions
-can be quite nicely defined by a series of \texttt{case} statements
-(see Lines 28 -- 38 for an example). If you need to represent an
-automaton with a sink state (catch-all-state), you can use Scala's
-pattern matching and write something like
+The most important point in this implemnetation is that I use Scala's
+partial functions for representing the transitions; alternatives would
+have been \texttt{Maps} or even \texttt{Lists}. One of the main
+advantages of using partial functions is that transitions can be quite
+nicely defined by a series of \texttt{case} statements (see Lines 28
+-- 38 for an example). If you need to represent an automaton with a
+sink state (catch-all-state), you can use Scala's pattern matching and
+write something like
{\small\begin{lstlisting}[language=Scala,linebackgroundcolor=
{\ifodd\value{lstnumber}\color{capri!3}\fi}]
@@ -176,35 +176,39 @@
}
\end{lstlisting}}
-\noindent The DFA-class has also an argument for specifying final
-states. In the implementation it not a set of states, as in the
-matemathical definition, but is a function from states to booleans
-(this function is supposed to return true whenever a state is meant to
-be final; false otherwise). While this boolean function is different
-from the sets of states, Scala allows to use sets for such functions
-(see Line 40 where the DFA is initialised). Again it will become clear
-later on why I use functions for final states, rather than sets.
+\noindent I let you think what this DFA looks like in the graphical
+notation.
+
+The DFA-class has also an argument for specifying final states. In the
+implementation it not a set of states, as in the matemathical
+definition, but a function from states to booleans (this function is
+supposed to return true whenever a state is final; false
+otherwise). While this boolean function is different from the sets of
+states, Scala allows to use sets for such functions (see Line 40 where
+the DFA is initialised). Again it will become clear later on why I use
+functions for final states, rather than sets.
I let you ponder whether this is a good implementation of DFAs. In
-doing so I hope you notice that the $Qs$ component (the set of finite
-states) is missing from the class definition. This means that the
-implementation allows you to do some fishy things you are not meant to
-do with DFAs. Which fishy things could that be?
+doing so I hope you notice that the $\varSigma$ and $Qs$ components (the
+alphabet and the set of finite states, respectively) are missing from
+the class definition. This means that the implementation allows you to
+do some fishy things you are not meant to do with DFAs. Which fishy
+things could that be?
\subsection*{Non-Deterministic Finite Automata}
-While with DFAs it will always be clear that given a state and a
+While with DFAs it is always be clear that given a state and a
character what the next state is (potentially none), it will be useful
-to relax this restriction. That means we allow to have several
+to relax this restriction. That means we allow states to have several
potential successor states. We even allow more than one starting
-state. The resulting construction is called a \emph{non-deterministic
- finite automaton} (NFA) given also as a five-tuple $A(\Sigma, Qs, Q_{0s}, F,
-\rho)$ where
+state. The resulting construction is called a \emph{Non-Deterministic
+ Finite Automaton} (NFA) given also as a five-tuple ${\cal A}(\varSigma,
+Qs, Q_{0s}, F, \rho)$ where
\begin{itemize}
-\item $\Sigma$ is an alphabet,
+\item $\varSigma$ is an alphabet,
\item $Qs$ is a finite set of states
\item $Q_{0s}$ is a set of start states ($Q_{0s} \subseteq Qs$)
\item $F$ are some accepting states with $F \subseteq Qs$, and
@@ -235,11 +239,11 @@
This NFA happens to have only one starting state, but in general there
could be more. Notice that in state $Q_0$ we might go to state $Q_1$
\emph{or} to state $Q_2$ when receiving an $a$. Similarly in state
-$Q_1$ and receiving an $a$, we can stay in state $Q_1$ or go to
-$Q_2$. This kind of choice is not allowed with DFAs. The downside is
-that when it comes to deciding whether a string is accepted by a NFA
-we potentially have to explore all possibilities. I let you think
-which kind of strings the above NFA accepts.
+$Q_1$ and receiving an $a$, we can stay in state $Q_1$ \emph{or} go to
+$Q_2$. This kind of choice is not allowed with DFAs. The downside of
+this choice is that when it comes to deciding whether a string is
+accepted by a NFA we potentially have to explore all possibilities. I
+let you think which kind of strings the above NFA accepts.
There are a number of additional points you should note with
@@ -260,12 +264,12 @@
become clear in a moment, I use sets of partial functions
instead---see Line 7 in Figure~\ref{nfa}. DFAs have only one such
partial function; my NFAs have a set. Another parameter,
-\texttt{Starts}, is in NFAs a set of states; \texttt{fins} is again a
+\texttt{starts}, is in NFAs a set of states; \texttt{fins} is again a
function from states to booleans. The \texttt{next} function
calculates the set of next states reachable from a single state
-\texttt{q} by a character \texttt{c}---this is calculated by going
+\texttt{q} by a character~\texttt{c}---this is calculated by going
through all the partial functions in the \texttt{delta}-set and apply
-\texttt{q} and \texttt{c} (see Line 13). This gives us a set of
+\texttt{q} and \texttt{c} (see Line 13). This gives a set of
\texttt{Some}s (in case the application succeeded) and possibly some
\texttt{None}s (in case the partial function is not defined or produces an
error). The \texttt{None}s are filtered out by the \texttt{flatMap},
@@ -274,7 +278,7 @@
states. \texttt{Deltas} and \texttt{accept} are similar to the DFA
definitions.
-\begin{figure}[t]
+\begin{figure}[p]
\small
\lstinputlisting[numbers=left,linebackgroundcolor=
{\ifodd\value{lstnumber}\color{capri!3}\fi}]
@@ -288,10 +292,12 @@
\end{figure}
The reason for using sets of partial functions for specifying the
-transitions in NFAs has to do with examples like this one: a
-popular benchmark regular expression is $(.)^*\cdot a\cdot
-(.)^{\{n\}}\cdot b\cdot c$. A NFA that accepts the same strings
-(for $n=3$) is as follows:
+transitions in NFAs has to do with pattern matching. Consider the
+following example: a popular benchmark regular expression is
+$(.)^*\cdot a\cdot (.)^{\{n\}}\cdot b\cdot c$. The important point to
+note is that it uses $.$ in order to represent the regular expression
+that accepts any character. A NFA that accepts the same strings as
+this regular expression (for $n=3$) is as follows:
\begin{center}
\begin{tikzpicture}[>=stealth',very thick, auto, node distance=7mm,
@@ -315,7 +321,7 @@
\end{center}
\noindent
-The $.$ stands for accepting any single character: for example if we
+Also here the $.$ stands for accepting any single character: for example if we
are in $Q_0$ and read an $a$ we can either stay in $Q_0$ (since any
character will do for this) or advance to $Q_1$ (but only if it is an
$a$). Why this is a good benchmark regular expression is irrelevant
@@ -342,25 +348,26 @@
underscore-pattern-matching. Recall that in $Q_0$ if we read an $a$ we
can go to $Q_1$ (by the first partial function in the set) and also
stay in $Q_0$ (by the second partial function). Representing such
-transitions in any other way is somehow awkward; the set of partial
-function representation makes them easy to implement.
+transitions in any other way in Scala seems to be somehow awkward; the
+set of partial function representation makes them easy to implement.
Look very careful again at the \texttt{accepts} and \texttt{deltas}
functions in NFAs and remember that when accepting a string by an NFA
-we might have to explore all possible transitions (which state to go
-to is not unique anymore). The shown implementation achieves this
-exploration in a \emph{breath-first search}. This is fine for very
-small NFAs, but can lead to problems when the NFAs are bigger. Take
-for example the regular expression $(.)^*\cdot a\cdot (.)^{\{n\}}\cdot
-b\cdot c$ from above. If $n$ is large, say 100 or 1000, then the
-corresponding NFA will have 104, respectively 1004, nodes. The problem
-is that with certain strings this can lead to 1000 ``active'' nodes
-which we need to analyse when determining the next states. This can be
-a real memory strain in practical applications. As a result, some
-regular expression matching engines resort to a \emph{depth-first
- search} with \emph{backtracking} in unsuccessful cases. In our
-implementation we could implement a depth-first version of
-\texttt{accepts} using Scala's \texttt{exists} as follows:
+we might have to explore all possible transitions (recall which state
+to go to is not unique anymore with NFAs). The implementation achieves
+this exploration in a \emph{breadth-first search} manner. This is fine
+for very small NFAs, but can lead to problems when the NFAs are
+bigger. Take for example the regular expression $(.)^*\cdot a\cdot
+(.)^{\{n\}}\cdot b\cdot c$ from above. If $n$ is large, say 100 or
+1000, then the corresponding NFA will have 104, respectively 1004,
+nodes. The problem is that with certain strings this can lead to 1000
+``active'' nodes in the breadth-first search, all of which we need to
+analyse when determining the next states. This can be a real memory
+strain in practical applications. As result, some regular expression
+matching engines resort to a \emph{depth-first search} with
+\emph{backtracking} in unsuccessful cases. In our implementation we
+can implement a depth-first version of \texttt{accepts} using Scala's
+\texttt{exists} as follows:
{\small\begin{lstlisting}[language=Scala,linebackgroundcolor=
@@ -368,7 +375,7 @@
def search(q: A, s: List[C]) : Boolean = s match {
case Nil => fins(q)
case c::cs =>
- delta.exists((d) => Try(search(d(q, c), cs)) getOrElse false)
+ delta.exists(d => Try(search(d(q, c), cs)) getOrElse false)
}
def accepts(s: List[C]) : Boolean =
@@ -378,10 +385,11 @@
\noindent
This depth-first way of exploration seems to work efficiently in many
examples and is much less of strain on memory. The problem is that the
-backtracking can get catastrophic in some examples---remember the
+backtracking can get ``catastrophic'' in some examples---remember the
catastrophic backtracking from earlier lectures. This depth-first
search with backtracking is the reason for the abysmal performance of
-regular expression macthing in Java, Ruby and Python.
+some regular expression macthings in Java, Ruby and Python. I like to
+show you this next.
%This means if
%we need to decide whether a string is accepted by a NFA, we might have
@@ -396,6 +404,13 @@
\subsubsection*{Thompson Construction}
+In order to get an idea what calculations are done in Java \& friends,
+we need a method for translating regular expressions into
+automata. The simplest and most well-known method is called
+\emph{Thompson Construction}, after the Turing Award winner Ken
+Thompson who implemented this method in early versions of grep????
+
+
The reason for introducing NFAs is that there is a relatively
simple (recursive) translation of regular expressions into
NFAs. Consider the simple regular expressions $\ZERO$,