# HG changeset patch # User Christian Urban # Date 1491862966 -28800 # Node ID 6f508bcdaa3035c731e2b83bdb1278390e736787 # Parent 0f6e3c5a1751b4127affd9eda5c33ac5d538d890 updated diff -r 0f6e3c5a1751 -r 6f508bcdaa30 handouts/ho03.pdf Binary file handouts/ho03.pdf has changed diff -r 0f6e3c5a1751 -r 6f508bcdaa30 handouts/ho03.tex --- a/handouts/ho03.tex Mon Apr 03 01:10:54 2017 +0800 +++ b/handouts/ho03.tex Tue Apr 11 06:22:46 2017 +0800 @@ -29,13 +29,14 @@ \subsection*{Deterministic Finite Automata} -Their definition is as follows:\medskip +The central definition is:\medskip \noindent A \emph{deterministic finite automaton} (DFA), say $A$, is -given by a four-tuple written $A(Qs, Q_0, F, \delta)$ where +given by a five-tuple written $A(\Sigma, Qs, Q_0, F, \delta)$ where \begin{itemize} +\item $\Sigma$ is an alphabet, \item $Qs$ is a finite set of states, \item $Q_0 \in Qs$ is the start state, \item $F \subseteq Qs$ are the accepting states, and @@ -47,8 +48,8 @@ the assumption that these transition functions do not need to be defined everywhere: so it can be the case that given a character there is no next state, in which case we need to raise a kind of ``failure -exception''. That means actually we have partial functions as -transitions---see our implementation later on. A typical example of a +exception''. That means actually we have \emph{partial} functions as +transitions---see the implementation later on. A typical example of a DFA is \begin{center} @@ -115,7 +116,7 @@ the way we hit the case where the transition function $\delta$ is not defined, we need to raise an error. In our implementation we will deal with this case elegantly by using Scala's \texttt{Try}. So a string -$s$ is in the \emph{language accepted by the automaton} $A(Q, Q_0, F, +$s$ is in the \emph{language accepted by the automaton} $A(\Sigma, Q, Q_0, F, \delta)$ iff \[ @@ -123,7 +124,7 @@ \] \noindent I let you think about a definition that describes -the set of strings accepted by an automaton. +the set of all strings accepted by an automaton. \begin{figure}[p] \small @@ -133,8 +134,8 @@ \caption{A Scala implementation of DFAs using partial functions. Notice some subtleties: \texttt{deltas} implements the delta-hat construction by lifting the transition (partial) function to - lists of \texttt{C}haracters. Since \texttt{delta} is given - as partial function, it can obviously go ``wrong'' in which + lists of characters. Since \texttt{delta} is given + as a partial function, it can obviously go ``wrong'' in which case the \texttt{Try} in \texttt{accepts} catches the error and returns \texttt{false}---that means the string is not accepted. The example \texttt{delta} implements the DFA example shown @@ -142,19 +143,25 @@ \end{figure} A simple Scala implementation for DFAs is given in -Figure~\ref{dfa}. As you can see, there many features of the +Figure~\ref{dfa}. As you can see, there are many features of the mathematical definition that are quite closely reflected in the code. In the DFA-class, there is a starting state, called -\texttt{start}, with the polymorphic type \texttt{A}. (The reason for -having a polymorphic types for the states and the input of DFAs will -become clearer later.) There is a partial function \texttt{delta} for -specifying the transitions. I used partial functions for representing -the transitions in Scala; alternatives would have been \texttt{Maps} -or even \texttt{Lists}. One of the main advantages of using partial -functions is that transitions can be quite nicely defined by a series -of \texttt{case} statements (see Lines 28 -- 38). If you need to -represent an automaton with a sink state (catch-all-state), you can -write something like +\texttt{start}, with the polymorphic type \texttt{A}. There is a +partial function \texttt{delta} for specifying the transitions---these +partial functions take a state (of polymorphic type \texttt{A}) and an +input (of polymorphic type \texttt{C}) and produce a new state (of +type \texttt{A}). For the moment it is OK to assume that \texttt{A} is +some arbitrary type for states and the input is just characters. (The +reason for having a polymorphic types for the states and the input of +DFAs will become clearer later on.) + +I used partial functions for representing the transitions in Scala; +alternatives would have been \texttt{Maps} or even \texttt{Lists}. One +of the main advantages of using partial functions is that transitions +can be quite nicely defined by a series of \texttt{case} statements +(see Lines 28 -- 38 for an example). If you need to represent an +automaton with a sink state (catch-all-state), you can use Scala's +pattern matching and write something like {\small\begin{lstlisting}[language=Scala,linebackgroundcolor= {\ifodd\value{lstnumber}\color{capri!3}\fi}] @@ -170,11 +177,13 @@ \end{lstlisting}} \noindent The DFA-class has also an argument for specifying final -states. In the implementation it is a function from states to booleans -(returns true whenever a state is meant to be final; false -otherwise). While this boolean function is different from the sets of -states used in the mathematical definition, Scala allows me to use -sets for such functions (see Line 40 where I initialise the DFA). +states. In the implementation it not a set of states, as in the +matemathical definition, but is a function from states to booleans +(this function is supposed to return true whenever a state is meant to +be final; false otherwise). While this boolean function is different +from the sets of states, Scala allows to use sets for such functions +(see Line 40 where the DFA is initialised). Again it will become clear +later on why I use functions for final states, rather than sets. I let you ponder whether this is a good implementation of DFAs. In doing so I hope you notice that the $Qs$ component (the set of finite @@ -191,10 +200,11 @@ to relax this restriction. That means we allow to have several potential successor states. We even allow more than one starting state. The resulting construction is called a \emph{non-deterministic - finite automaton} (NFA) given also as a four-tuple $A(Qs, Q_{0s}, F, + finite automaton} (NFA) given also as a five-tuple $A(\Sigma, Qs, Q_{0s}, F, \rho)$ where \begin{itemize} +\item $\Sigma$ is an alphabet, \item $Qs$ is a finite set of states \item $Q_{0s}$ is a set of start states ($Q_{0s} \subseteq Qs$) \item $F$ are some accepting states with $F \subseteq Qs$, and @@ -202,12 +212,13 @@ \end{itemize} \noindent -A typical example of an NFA is +A typical example of a NFA is % A NFA for (ab* + b)*a \begin{center} -\begin{tikzpicture}[scale=0.8,>=stealth',very thick, -every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},] +\begin{tikzpicture}[>=stealth',very thick, auto, + every state/.style={minimum size=0pt,inner sep=3pt, + draw=blue!50,very thick,fill=blue!20},scale=2] \node[state,initial] (Q_0) {$Q_0$}; \node[state] (Q_1) [right=of Q_0] {$Q_1$}; \node[state, accepting] (Q_2) [right=of Q_1] {$Q_2$}; @@ -223,40 +234,45 @@ \noindent This NFA happens to have only one starting state, but in general there could be more. Notice that in state $Q_0$ we might go to state $Q_1$ -\emph{or} to state $Q_2$ when receiving an $a$. Similarly in state $Q_1$ -and receiving an $a$, we can stay in state $Q_1$ or go to $Q_2$. This -kind of choice is not allowed with DFAs. When it comes to deciding -whether a string is accepted by an NFA we potentially need to explore -all possibilities. I let you think which kind of strings this NFA -accepts. +\emph{or} to state $Q_2$ when receiving an $a$. Similarly in state +$Q_1$ and receiving an $a$, we can stay in state $Q_1$ or go to +$Q_2$. This kind of choice is not allowed with DFAs. The downside is +that when it comes to deciding whether a string is accepted by a NFA +we potentially have to explore all possibilities. I let you think +which kind of strings the above NFA accepts. -There are however a number of additional points you should note. Every -DFA is a NFA, but not vice versa. The $\rho$ in NFAs is a transition -\emph{relation} (DFAs have a transition function). The difference -between a function and a relation is that a function has always a -single output, while a relation gives, roughly speaking, several -outputs. Look again at the NFA above: if you are currently in the -state $Q_1$ and you read a character $b$, then you can transition to -either $Q_0$ \emph{or} $Q_2$. Which route, or output, you take is not -determined. This non-determinism can be represented by a relation. +There are a number of additional points you should note with +NFAs. Every DFA is a NFA, but not vice versa. The $\rho$ in NFAs is a +transition \emph{relation} (DFAs have a transition function). The +difference between a function and a relation is that a function has +always a single output, while a relation gives, roughly speaking, +several outputs. Look again at the NFA above: if you are currently in +the state $Q_1$ and you read a character $b$, then you can transition +to either $Q_0$ \emph{or} $Q_2$. Which route, or output, you take is +not determined. This non-determinism can be represented by a +relation. -My implementation of NFAs is shown in Figure~\ref{nfa}. Perhaps -interestingly, I do not use relations for my implementation of NFAs in -Scala, and I also do not use transition functions which return a set -of states (another popular choice for implementing NFAs). For reasons -that become clear in a moment, I use sets of partial functions---see -Line 7 in Figure~\ref{nfa}. \texttt{Starts} is now a set of states; -\texttt{fins} is again a function from states to booleans. The -\texttt{next} function calculates the set of next states reachable -from a state \texttt{q} by a character \texttt{c}---we go through all -partial functions in the \texttt{delta}-set, lift it (this transformes -the partial function into the corresponding \texttt{Option}-function -and then apply \texttt{q} and \texttt{c}. This gives us a set of -\texttt{Some}s and \texttt{None}s, where the \texttt{Some}s are -filtered out by the \texttt{flatMap}. Teh function \texttt{nexts} just -lifts this to sets of states. \texttt{Deltas} and \texttt{accept} are -similar to the DFA definitions. +My implementation of NFAs in Scala is shown in Figure~\ref{nfa}. +Perhaps interestingly, I do not actually use relations for my NFAs, +and I also do not use transition functions that return sets of states +(another popular choice for implementing NFAs). For reasons that +become clear in a moment, I use sets of partial functions +instead---see Line 7 in Figure~\ref{nfa}. DFAs have only one such +partial function; my NFAs have a set. Another parameter, +\texttt{Starts}, is in NFAs a set of states; \texttt{fins} is again a +function from states to booleans. The \texttt{next} function +calculates the set of next states reachable from a single state +\texttt{q} by a character \texttt{c}---this is calculated by going +through all the partial functions in the \texttt{delta}-set and apply +\texttt{q} and \texttt{c} (see Line 13). This gives us a set of +\texttt{Some}s (in case the application succeeded) and possibly some +\texttt{None}s (in case the partial function is not defined or produces an +error). The \texttt{None}s are filtered out by the \texttt{flatMap}, +leaving the values inside the \texttt{Some}s. The function +\texttt{nexts} just lifts this function to sets of +states. \texttt{Deltas} and \texttt{accept} are similar to the DFA +definitions. \begin{figure}[t] \small @@ -264,17 +280,108 @@ {\ifodd\value{lstnumber}\color{capri!3}\fi}] {../progs/nfa.scala} \caption{A Scala implementation of NFAs using sets of partial functions. - Notice some subtleties: \texttt{deltas} implements the delta-hat - construction by lifting the transition (partial) function to - lists of \texttt{C}haracters. Since \texttt{delta} is given - as partial function, it can obviously go ``wrong'' in which - case the \texttt{Try} in \texttt{accepts} catches the error and - returns \texttt{false}---that means the string is not accepted. - The example \texttt{delta} implements the DFA example shown - earlier in the handout.\label{nfa}} + Notice some subtleties: Since \texttt{delta} is given + as a set of partial functions, each of them can obviously go ``wrong'' in which + case the \texttt{Try}. The function \texttt{accepts} implements the + acceptance of a string in a breath-first fashion. This can be costly + way of deciding whether a string is accepted in practical contexts.\label{nfa}} \end{figure} +The reason for using sets of partial functions for specifying the +transitions in NFAs has to do with examples like this one: a +popular benchmark regular expression is $(.)^*\cdot a\cdot +(.)^{\{n\}}\cdot b\cdot c$. A NFA that accepts the same strings +(for $n=3$) is as follows: +\begin{center} +\begin{tikzpicture}[>=stealth',very thick, auto, node distance=7mm, + every state/.style={minimum size=0pt,inner sep=1pt, + draw=blue!50,very thick,fill=blue!20},scale=0.5] +\node[state,initial] (Q_0) {$Q_0$}; +\node[state] (Q_1) [right=of Q_0] {$Q_1$}; +\node[state] (Q_2) [right=of Q_1] {$Q_2$}; +\node[state] (Q_3) [right=of Q_2] {$Q_3$}; +\node[state] (Q_4) [right=of Q_3] {$Q_4$}; +\node[state] (Q_5) [right=of Q_4] {$Q_5$}; +\node[state,accepting] (Q_6) [right=of Q_5] {$Q_6$}; +\path[->] (Q_0) edge [loop above] node {$.$} (); +\path[->] (Q_0) edge node [above] {$a$} (Q_1); +\path[->] (Q_1) edge node [above] {$.$} (Q_2); +\path[->] (Q_2) edge node [above] {$.$} (Q_3); +\path[->] (Q_3) edge node [above] {$.$} (Q_4); +\path[->] (Q_4) edge node [above] {$b$} (Q_5); +\path[->] (Q_5) edge node [above] {$c$} (Q_6); +\end{tikzpicture} +\end{center} + +\noindent +The $.$ stands for accepting any single character: for example if we +are in $Q_0$ and read an $a$ we can either stay in $Q_0$ (since any +character will do for this) or advance to $Q_1$ (but only if it is an +$a$). Why this is a good benchmark regular expression is irrelevant +here. The point is that this NFA can be conveniently represented by +the code: + +{\small\begin{lstlisting}[language=Scala,linebackgroundcolor= + {\ifodd\value{lstnumber}\color{capri!3}\fi}] +val delta = Set[(State, Char) :=> State]( + { case (Q0, 'a') => Q1 }, + { case (Q0, _) => Q0 }, + { case (Q1, _) => Q2 }, + { case (Q2, _) => Q3 }, + { case (Q3, _) => Q4 }, + { case (Q4, 'b') => Q5 }, + { case (Q5, 'c') => Q6 } +) + +NFA(Set[State](Q0), delta, Set[State](Q6)) +\end{lstlisting}} + +\noindent +where the $.$-transitions translate into a +underscore-pattern-matching. Recall that in $Q_0$ if we read an $a$ we +can go to $Q_1$ (by the first partial function in the set) and also +stay in $Q_0$ (by the second partial function). Representing such +transitions in any other way is somehow awkward; the set of partial +function representation makes them easy to implement. + +Look very careful again at the \texttt{accepts} and \texttt{deltas} +functions in NFAs and remember that when accepting a string by an NFA +we might have to explore all possible transitions (which state to go +to is not unique anymore). The shown implementation achieves this +exploration in a \emph{breath-first search}. This is fine for very +small NFAs, but can lead to problems when the NFAs are bigger. Take +for example the regular expression $(.)^*\cdot a\cdot (.)^{\{n\}}\cdot +b\cdot c$ from above. If $n$ is large, say 100 or 1000, then the +corresponding NFA will have 104, respectively 1004, nodes. The problem +is that with certain strings this can lead to 1000 ``active'' nodes +which we need to analyse when determining the next states. This can be +a real memory strain in practical applications. As a result, some +regular expression matching engines resort to a \emph{depth-first + search} with \emph{backtracking} in unsuccessful cases. In our +implementation we could implement a depth-first version of +\texttt{accepts} using Scala's \texttt{exists} as follows: + + +{\small\begin{lstlisting}[language=Scala,linebackgroundcolor= + {\ifodd\value{lstnumber}\color{capri!3}\fi}] +def search(q: A, s: List[C]) : Boolean = s match { + case Nil => fins(q) + case c::cs => + delta.exists((d) => Try(search(d(q, c), cs)) getOrElse false) +} + +def accepts(s: List[C]) : Boolean = + starts.exists(search(_, s)) +\end{lstlisting}} + +\noindent +This depth-first way of exploration seems to work efficiently in many +examples and is much less of strain on memory. The problem is that the +backtracking can get catastrophic in some examples---remember the +catastrophic backtracking from earlier lectures. This depth-first +search with backtracking is the reason for the abysmal performance of +regular expression macthing in Java, Ruby and Python. %This means if %we need to decide whether a string is accepted by a NFA, we might have @@ -780,7 +887,7 @@ \subsubsection*{Automata Minimization} As seen in the subset construction, the translation -of an NFA to a DFA can result in a rather ``inefficient'' +of a NFA to a DFA can result in a rather ``inefficient'' DFA. Meaning there are states that are not needed. A DFA can be \emph{minimised} by the following algorithm: @@ -974,7 +1081,7 @@ or not is computationally not cheap. Remember with NFAs we have potentially many next states even for the same input and also have the silent $\epsilon$-transitions. If we want to -find a path from the starting state of an NFA to an accepting +find a path from the starting state of a NFA to an accepting state, we need to consider all possibilities. In Ruby and Python this is done by a depth-first search, which in turn means that if a ``wrong'' choice is made, the algorithm has to @@ -1008,7 +1115,7 @@ for this is as follows: take a regular expression, translate it into a NFA and then a DFA that both recognise the same language. Once you have the DFA it is very easy to construct -the automaton for the language not recognised by an DFA. If +the automaton for the language not recognised by a DFA. If the DFA is completed (this is important!), then you just need to exchange the accepting and non-accepting states. You can then translate this DFA back into a regular expression and diff -r 0f6e3c5a1751 -r 6f508bcdaa30 progs/dfa.scala --- a/progs/dfa.scala Mon Apr 03 01:10:54 2017 +0800 +++ b/progs/dfa.scala Tue Apr 11 06:22:46 2017 +0800 @@ -4,9 +4,9 @@ // type abbreviation for partial functions type :=>[A, B] = PartialFunction[A, B] -case class DFA[A, C](start: A, // starting state - delta: (A, C) :=> A, // transition partial fun - fins: A => Boolean) { // final states +case class DFA[A, C](start: A, // starting state + delta: (A, C) :=> A, // transition (partial fun) + fins: A => Boolean) { // final states def deltas(q: A, s: List[C]) : A = s match { case Nil => q diff -r 0f6e3c5a1751 -r 6f508bcdaa30 progs/nfa.scala --- a/progs/nfa.scala Mon Apr 03 01:10:54 2017 +0800 +++ b/progs/nfa.scala Tue Apr 11 06:22:46 2017 +0800 @@ -10,7 +10,7 @@ // given a state and a character, what is the set of next states? // if there is none => empty set def next(q: A, c: C) : Set[A] = - delta.flatMap(_.lift.apply(q, c)) + delta.flatMap(d => Try(d(q, c)).toOption) def nexts(qs: Set[A], c: C) : Set[A] = qs.flatMap(next(_, c))