113 end up in an accepting state, then this string is accepted by the  | 
   114 end up in an accepting state, then this string is accepted by the  | 
   114 automaton. Otherwise it is not accepted. This also means that if along  | 
   115 automaton. Otherwise it is not accepted. This also means that if along  | 
   115 the way we hit the case where the transition function $\delta$ is not  | 
   116 the way we hit the case where the transition function $\delta$ is not  | 
   116 defined, we need to raise an error. In our implementation we will deal  | 
   117 defined, we need to raise an error. In our implementation we will deal  | 
   117 with this case elegantly by using Scala's \texttt{Try}. So a string | 
   118 with this case elegantly by using Scala's \texttt{Try}. So a string | 
   118 $s$ is in the \emph{language accepted by the automaton} $A(Q, Q_0, F, | 
   119 $s$ is in the \emph{language accepted by the automaton} $A(\Sigma, Q, Q_0, F, | 
   119 \delta)$ iff  | 
   120 \delta)$ iff  | 
   120   | 
   121   | 
   121 \[  | 
   122 \[  | 
   122 \hat{\delta}(Q_0, s) \in F  | 
   123 \hat{\delta}(Q_0, s) \in F  | 
   123 \]  | 
   124 \]  | 
   124   | 
   125   | 
   125 \noindent I let you think about a definition that describes  | 
   126 \noindent I let you think about a definition that describes  | 
   126 the set of strings accepted by an automaton.   | 
   127 the set of all strings accepted by an automaton.   | 
   127   | 
   128   | 
   128 \begin{figure}[p] | 
   129 \begin{figure}[p] | 
   129 \small  | 
   130 \small  | 
   130 \lstinputlisting[numbers=left,linebackgroundcolor=  | 
   131 \lstinputlisting[numbers=left,linebackgroundcolor=  | 
   131                   {\ifodd\value{lstnumber}\color{capri!3}\fi}] | 
   132                   {\ifodd\value{lstnumber}\color{capri!3}\fi}] | 
   132                   {../progs/dfa.scala} | 
   133                   {../progs/dfa.scala} | 
   133 \caption{A Scala implementation of DFAs using partial functions. | 
   134 \caption{A Scala implementation of DFAs using partial functions. | 
   134   Notice some subtleties: \texttt{deltas} implements the delta-hat | 
   135   Notice some subtleties: \texttt{deltas} implements the delta-hat | 
   135   construction by lifting the transition (partial) function to  | 
   136   construction by lifting the transition (partial) function to  | 
   136   lists of \texttt{C}haracters. Since \texttt{delta} is given | 
   137   lists of characters. Since \texttt{delta} is given | 
   137   as partial function, it can obviously go ``wrong'' in which  | 
   138   as a partial function, it can obviously go ``wrong'' in which  | 
   138   case the \texttt{Try} in \texttt{accepts} catches the error and | 
   139   case the \texttt{Try} in \texttt{accepts} catches the error and | 
   139   returns \texttt{false}---that means the string is not accepted. | 
   140   returns \texttt{false}---that means the string is not accepted. | 
   140   The example \texttt{delta} implements the DFA example shown | 
   141   The example \texttt{delta} implements the DFA example shown | 
   141   earlier in the handout.\label{dfa}} | 
   142   earlier in the handout.\label{dfa}} | 
   142 \end{figure} | 
   143 \end{figure} | 
   143   | 
   144   | 
   144 A simple Scala implementation for DFAs is given in  | 
   145 A simple Scala implementation for DFAs is given in  | 
   145 Figure~\ref{dfa}. As you can see, there many features of the | 
   146 Figure~\ref{dfa}. As you can see, there are many features of the | 
   146 mathematical definition that are quite closely reflected in the  | 
   147 mathematical definition that are quite closely reflected in the  | 
   147 code. In the DFA-class, there is a starting state, called  | 
   148 code. In the DFA-class, there is a starting state, called  | 
   148 \texttt{start}, with the polymorphic type \texttt{A}. (The reason for | 
   149 \texttt{start}, with the polymorphic type \texttt{A}.  There is a | 
   149 having a polymorphic types for the states and the input of DFAs will  | 
   150 partial function \texttt{delta} for specifying the transitions---these | 
   150 become clearer later.)  There is a partial function \texttt{delta} for | 
   151 partial functions take a state (of polymorphic type \texttt{A}) and an | 
   151 specifying the transitions. I used partial functions for representing  | 
   152 input (of polymorphic type \texttt{C}) and produce a new state (of | 
   152 the transitions in Scala; alternatives would have been \texttt{Maps} | 
   153 type \texttt{A}). For the moment it is OK to assume that \texttt{A} is | 
   153 or even \texttt{Lists}. One of the main advantages of using partial | 
   154 some arbitrary type for states and the input is just characters.  (The  | 
   154 functions is that transitions can be quite nicely defined by a series  | 
   155 reason for having a polymorphic types for the states and the input of  | 
   155 of \texttt{case} statements (see Lines 28 -- 38). If you need to | 
   156 DFAs will become clearer later on.)  | 
   156 represent an automaton with a sink state (catch-all-state), you can  | 
   157   | 
   157 write something like  | 
   158 I used partial functions for representing the transitions in Scala;  | 
         | 
   159 alternatives would have been \texttt{Maps} or even \texttt{Lists}. One | 
         | 
   160 of the main advantages of using partial functions is that transitions  | 
         | 
   161 can be quite nicely defined by a series of \texttt{case} statements | 
         | 
   162 (see Lines 28 -- 38 for an example). If you need to represent an  | 
         | 
   163 automaton with a sink state (catch-all-state), you can use Scala's  | 
         | 
   164 pattern matching and write something like  | 
   158   | 
   165   | 
   159 {\small\begin{lstlisting}[language=Scala,linebackgroundcolor= | 
   166 {\small\begin{lstlisting}[language=Scala,linebackgroundcolor= | 
   160                   {\ifodd\value{lstnumber}\color{capri!3}\fi}] | 
   167                   {\ifodd\value{lstnumber}\color{capri!3}\fi}] | 
   161 abstract class State  | 
   168 abstract class State  | 
   162 ...  | 
   169 ...  | 
   189 While with DFAs it will always be clear that given a state and a  | 
   198 While with DFAs it will always be clear that given a state and a  | 
   190 character what the next state is (potentially none), it will be useful  | 
   199 character what the next state is (potentially none), it will be useful  | 
   191 to relax this restriction. That means we allow to have several  | 
   200 to relax this restriction. That means we allow to have several  | 
   192 potential successor states. We even allow more than one starting  | 
   201 potential successor states. We even allow more than one starting  | 
   193 state. The resulting construction is called a \emph{non-deterministic | 
   202 state. The resulting construction is called a \emph{non-deterministic | 
   194   finite automaton} (NFA) given also as a four-tuple $A(Qs, Q_{0s}, F, | 
   203   finite automaton} (NFA) given also as a five-tuple $A(\Sigma, Qs, Q_{0s}, F, | 
   195 \rho)$ where  | 
   204 \rho)$ where  | 
   196   | 
   205   | 
   197 \begin{itemize} | 
   206 \begin{itemize} | 
         | 
   207 \item $\Sigma$ is an alphabet,    | 
   198 \item $Qs$ is a finite set of states  | 
   208 \item $Qs$ is a finite set of states  | 
   199 \item $Q_{0s}$ is a set of start states ($Q_{0s} \subseteq Qs$) | 
   209 \item $Q_{0s}$ is a set of start states ($Q_{0s} \subseteq Qs$) | 
   200 \item $F$ are some accepting states with $F \subseteq Qs$, and  | 
   210 \item $F$ are some accepting states with $F \subseteq Qs$, and  | 
   201 \item $\rho$ is a transition relation.  | 
   211 \item $\rho$ is a transition relation.  | 
   202 \end{itemize} | 
   212 \end{itemize} | 
   203   | 
   213   | 
   204 \noindent  | 
   214 \noindent  | 
   205 A typical example of an NFA is  | 
   215 A typical example of a NFA is  | 
   206   | 
   216   | 
   207 % A NFA for (ab* + b)*a  | 
   217 % A NFA for (ab* + b)*a  | 
   208 \begin{center} | 
   218 \begin{center} | 
   209 \begin{tikzpicture}[scale=0.8,>=stealth',very thick, | 
   219 \begin{tikzpicture}[>=stealth',very thick, auto, | 
   210 every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},] | 
   220     every state/.style={minimum size=0pt,inner sep=3pt, | 
         | 
   221       draw=blue!50,very thick,fill=blue!20},scale=2]  | 
   211 \node[state,initial]  (Q_0)  {$Q_0$}; | 
   222 \node[state,initial]  (Q_0)  {$Q_0$}; | 
   212 \node[state] (Q_1) [right=of Q_0] {$Q_1$}; | 
   223 \node[state] (Q_1) [right=of Q_0] {$Q_1$}; | 
   213 \node[state, accepting] (Q_2) [right=of Q_1] {$Q_2$}; | 
   224 \node[state, accepting] (Q_2) [right=of Q_1] {$Q_2$}; | 
   214 \path[->] (Q_0) edge [loop above] node  {$b$} (); | 
   225 \path[->] (Q_0) edge [loop above] node  {$b$} (); | 
   215 \path[<-] (Q_0) edge node [below]  {$b$} (Q_1); | 
   226 \path[<-] (Q_0) edge node [below]  {$b$} (Q_1); | 
   221 \end{center} | 
   232 \end{center} | 
   222   | 
   233   | 
   223 \noindent  | 
   234 \noindent  | 
   224 This NFA happens to have only one starting state, but in general there  | 
   235 This NFA happens to have only one starting state, but in general there  | 
   225 could be more.  Notice that in state $Q_0$ we might go to state $Q_1$  | 
   236 could be more.  Notice that in state $Q_0$ we might go to state $Q_1$  | 
   226 \emph{or} to state $Q_2$ when receiving an $a$. Similarly in state $Q_1$ | 
   237 \emph{or} to state $Q_2$ when receiving an $a$. Similarly in state | 
   227 and receiving an $a$, we can stay in state $Q_1$ or go to $Q_2$. This  | 
   238 $Q_1$ and receiving an $a$, we can stay in state $Q_1$ or go to  | 
   228 kind of choice is not allowed with DFAs. When it comes to deciding  | 
   239 $Q_2$. This kind of choice is not allowed with DFAs. The downside is  | 
   229 whether a string is accepted by an NFA we potentially need to explore  | 
   240 that when it comes to deciding whether a string is accepted by a NFA  | 
   230 all possibilities. I let you think which kind of strings this NFA  | 
   241 we potentially have to explore all possibilities. I let you think  | 
   231 accepts.  | 
   242 which kind of strings the above NFA accepts.  | 
   232   | 
   243   | 
   233   | 
   244   | 
   234 There are however a number of additional points you should note. Every  | 
   245 There are a number of additional points you should note with  | 
   235 DFA is a NFA, but not vice versa. The $\rho$ in NFAs is a transition  | 
   246 NFAs. Every DFA is a NFA, but not vice versa. The $\rho$ in NFAs is a  | 
   236 \emph{relation} (DFAs have a transition function). The difference | 
   247 transition \emph{relation} (DFAs have a transition function). The | 
   237 between a function and a relation is that a function has always a  | 
   248 difference between a function and a relation is that a function has  | 
   238 single output, while a relation gives, roughly speaking, several  | 
   249 always a single output, while a relation gives, roughly speaking,  | 
   239 outputs. Look again at the NFA above: if you are currently in the  | 
   250 several outputs. Look again at the NFA above: if you are currently in  | 
   240 state $Q_1$ and you read a character $b$, then you can transition to  | 
   251 the state $Q_1$ and you read a character $b$, then you can transition  | 
   241 either $Q_0$ \emph{or} $Q_2$. Which route, or output, you take is not | 
   252 to either $Q_0$ \emph{or} $Q_2$. Which route, or output, you take is | 
   242 determined.  This non-determinism can be represented by a relation.  | 
   253 not determined.  This non-determinism can be represented by a  | 
   243   | 
   254 relation.  | 
   244 My implementation of NFAs is shown in Figure~\ref{nfa}.  Perhaps | 
   255   | 
   245 interestingly, I do not use relations for my implementation of NFAs in  | 
   256 My implementation of NFAs in Scala is shown in Figure~\ref{nfa}. | 
   246 Scala, and I also do not use transition functions which return a set  | 
   257 Perhaps interestingly, I do not actually use relations for my NFAs,  | 
   247 of states (another popular choice for implementing NFAs).  For reasons  | 
   258 and I also do not use transition functions that return sets of states  | 
   248 that become clear in a moment, I use sets of partial functions---see  | 
   259 (another popular choice for implementing NFAs).  For reasons that  | 
   249 Line 7 in Figure~\ref{nfa}. \texttt{Starts} is now a set of states; | 
   260 become clear in a moment, I use sets of partial functions  | 
   250 \texttt{fins} is again a function from states to booleans. The | 
   261 instead---see Line 7 in Figure~\ref{nfa}. DFAs have only one such | 
   251 \texttt{next} function calculates the set of next states reachable | 
   262 partial function; my NFAs have a set.  Another parameter,  | 
   252 from a state \texttt{q} by a character \texttt{c}---we go through all | 
   263 \texttt{Starts}, is in NFAs a set of states; \texttt{fins} is again a | 
   253 partial functions in the \texttt{delta}-set, lift it (this transformes | 
   264 function from states to booleans. The \texttt{next} function | 
   254 the partial function into the corresponding \texttt{Option}-function | 
   265 calculates the set of next states reachable from a single state  | 
   255 and then apply \texttt{q} and \texttt{c}. This gives us a set of | 
   266 \texttt{q} by a character \texttt{c}---this is calculated by going | 
   256 \texttt{Some}s and \texttt{None}s, where the \texttt{Some}s are | 
   267 through all the partial functions in the \texttt{delta}-set and apply | 
   257 filtered out by the \texttt{flatMap}. Teh function \texttt{nexts} just | 
   268 \texttt{q} and \texttt{c} (see Line 13). This gives us a set of | 
   258 lifts this to sets of states. \texttt{Deltas} and \texttt{accept} are | 
   269 \texttt{Some}s (in case the application succeeded) and possibly some | 
   259 similar to the DFA definitions.  | 
   270 \texttt{None}s (in case the partial function is not defined or produces an | 
         | 
   271 error).  The \texttt{None}s are filtered out by the \texttt{flatMap}, | 
         | 
   272 leaving the values inside the \texttt{Some}s. The function | 
         | 
   273 \texttt{nexts} just lifts this function to sets of | 
         | 
   274 states. \texttt{Deltas} and \texttt{accept} are similar to the DFA | 
         | 
   275 definitions.  | 
   260   | 
   276   | 
   261 \begin{figure}[t] | 
   277 \begin{figure}[t] | 
   262 \small  | 
   278 \small  | 
   263 \lstinputlisting[numbers=left,linebackgroundcolor=  | 
   279 \lstinputlisting[numbers=left,linebackgroundcolor=  | 
   264                   {\ifodd\value{lstnumber}\color{capri!3}\fi}] | 
   280                   {\ifodd\value{lstnumber}\color{capri!3}\fi}] | 
   265                   {../progs/nfa.scala} | 
   281                   {../progs/nfa.scala} | 
   266 \caption{A Scala implementation of NFAs using sets of partial functions. | 
   282 \caption{A Scala implementation of NFAs using sets of partial functions. | 
   267   Notice some subtleties: \texttt{deltas} implements the delta-hat | 
   283   Notice some subtleties: Since \texttt{delta} is given | 
   268   construction by lifting the transition (partial) function to  | 
   284   as a set of partial functions, each of them can obviously go ``wrong'' in which  | 
   269   lists of \texttt{C}haracters. Since \texttt{delta} is given | 
   285   case the \texttt{Try}. The function \texttt{accepts} implements the | 
   270   as partial function, it can obviously go ``wrong'' in which  | 
   286   acceptance of a string in a breath-first fashion. This can be costly  | 
   271   case the \texttt{Try} in \texttt{accepts} catches the error and | 
   287   way of deciding whether a string is accepted in practical contexts.\label{nfa}} | 
   272   returns \texttt{false}---that means the string is not accepted. | 
         | 
   273   The example \texttt{delta} implements the DFA example shown | 
         | 
   274   earlier in the handout.\label{nfa}} | 
         | 
   275 \end{figure} | 
   288 \end{figure} | 
   276   | 
   289   | 
   277   | 
   290 The reason for using sets of partial functions for specifying the  | 
         | 
   291 transitions in NFAs has to do with examples like this one: a  | 
         | 
   292 popular benchmark regular expression is $(.)^*\cdot a\cdot  | 
         | 
   293 (.)^{\{n\}}\cdot b\cdot c$. A NFA that accepts the same strings | 
         | 
   294 (for $n=3$) is as follows:  | 
         | 
   295   | 
         | 
   296 \begin{center} | 
         | 
   297 \begin{tikzpicture}[>=stealth',very thick, auto, node distance=7mm, | 
         | 
   298     every state/.style={minimum size=0pt,inner sep=1pt, | 
         | 
   299       draw=blue!50,very thick,fill=blue!20},scale=0.5]  | 
         | 
   300 \node[state,initial]  (Q_0)  {$Q_0$}; | 
         | 
   301 \node[state] (Q_1) [right=of Q_0] {$Q_1$}; | 
         | 
   302 \node[state] (Q_2) [right=of Q_1] {$Q_2$}; | 
         | 
   303 \node[state] (Q_3) [right=of Q_2] {$Q_3$}; | 
         | 
   304 \node[state] (Q_4) [right=of Q_3] {$Q_4$}; | 
         | 
   305 \node[state] (Q_5) [right=of Q_4] {$Q_5$}; | 
         | 
   306 \node[state,accepting] (Q_6) [right=of Q_5] {$Q_6$}; | 
         | 
   307 \path[->] (Q_0) edge [loop above] node  {$.$} (); | 
         | 
   308 \path[->] (Q_0) edge node [above]  {$a$} (Q_1); | 
         | 
   309 \path[->] (Q_1) edge node [above]  {$.$} (Q_2); | 
         | 
   310 \path[->] (Q_2) edge node [above]  {$.$} (Q_3); | 
         | 
   311 \path[->] (Q_3) edge node [above]  {$.$} (Q_4); | 
         | 
   312 \path[->] (Q_4) edge node [above]  {$b$} (Q_5); | 
         | 
   313 \path[->] (Q_5) edge node [above]  {$c$} (Q_6); | 
         | 
   314 \end{tikzpicture} | 
         | 
   315 \end{center} | 
         | 
   316   | 
         | 
   317 \noindent  | 
         | 
   318 The $.$ stands for accepting any single character: for example if we  | 
         | 
   319 are in $Q_0$ and read an $a$ we can either stay in $Q_0$ (since any  | 
         | 
   320 character will do for this) or advance to $Q_1$ (but only if it is an  | 
         | 
   321 $a$). Why this is a good benchmark regular expression is irrelevant  | 
         | 
   322 here. The point is that this NFA can be conveniently represented by  | 
         | 
   323 the code:  | 
         | 
   324   | 
         | 
   325 {\small\begin{lstlisting}[language=Scala,linebackgroundcolor= | 
         | 
   326                   {\ifodd\value{lstnumber}\color{capri!3}\fi}] | 
         | 
   327 val delta = Set[(State, Char) :=> State](  | 
         | 
   328   { case (Q0, 'a') => Q1 }, | 
         | 
   329   { case (Q0, _)   => Q0 }, | 
         | 
   330   { case (Q1, _)   => Q2 }, | 
         | 
   331   { case (Q2, _)   => Q3 }, | 
         | 
   332   { case (Q3, _)   => Q4 }, | 
         | 
   333   { case (Q4, 'b') => Q5 }, | 
         | 
   334   { case (Q5, 'c') => Q6 } | 
         | 
   335 )  | 
         | 
   336   | 
         | 
   337 NFA(Set[State](Q0), delta, Set[State](Q6))  | 
         | 
   338 \end{lstlisting}} | 
         | 
   339   | 
         | 
   340 \noindent  | 
         | 
   341 where the $.$-transitions translate into a  | 
         | 
   342 underscore-pattern-matching. Recall that in $Q_0$ if we read an $a$ we  | 
         | 
   343 can go to $Q_1$ (by the first partial function in the set) and also  | 
         | 
   344 stay in $Q_0$ (by the second partial function). Representing such  | 
         | 
   345 transitions in any other way is somehow awkward; the set of partial  | 
         | 
   346 function representation makes them easy to implement.  | 
         | 
   347   | 
         | 
   348 Look very careful again at the \texttt{accepts} and \texttt{deltas} | 
         | 
   349 functions in NFAs and remember that when accepting a string by an NFA  | 
         | 
   350 we might have to explore all possible transitions (which state to go  | 
         | 
   351 to is not unique anymore). The shown implementation achieves this  | 
         | 
   352 exploration in a \emph{breath-first search}. This is fine for very | 
         | 
   353 small NFAs, but can lead to problems when the NFAs are bigger. Take  | 
         | 
   354 for example the regular expression $(.)^*\cdot a\cdot (.)^{\{n\}}\cdot | 
         | 
   355 b\cdot c$ from above. If $n$ is large, say 100 or 1000, then the  | 
         | 
   356 corresponding NFA will have 104, respectively 1004, nodes. The problem  | 
         | 
   357 is that with certain strings this can lead to 1000 ``active'' nodes  | 
         | 
   358 which we need to analyse when determining the next states. This can be  | 
         | 
   359 a real memory strain in practical applications. As a result, some  | 
         | 
   360 regular expression matching engines resort to a \emph{depth-first | 
         | 
   361   search} with \emph{backtracking} in unsuccessful cases. In our | 
         | 
   362 implementation we could implement a depth-first version of  | 
         | 
   363 \texttt{accepts} using Scala's \texttt{exists} as follows: | 
         | 
   364   | 
         | 
   365   | 
         | 
   366 {\small\begin{lstlisting}[language=Scala,linebackgroundcolor= | 
         | 
   367                   {\ifodd\value{lstnumber}\color{capri!3}\fi}] | 
         | 
   368 def search(q: A, s: List[C]) : Boolean = s match { | 
         | 
   369   case Nil => fins(q)  | 
         | 
   370   case c::cs =>  | 
         | 
   371     delta.exists((d) => Try(search(d(q, c), cs)) getOrElse false)  | 
         | 
   372 }  | 
         | 
   373   | 
         | 
   374 def accepts(s: List[C]) : Boolean =   | 
         | 
   375   starts.exists(search(_, s))  | 
         | 
   376 \end{lstlisting}} | 
         | 
   377   | 
         | 
   378 \noindent  | 
         | 
   379 This depth-first way of exploration seems to work efficiently in many  | 
         | 
   380 examples and is much less of strain on memory. The problem is that the  | 
         | 
   381 backtracking can get catastrophic in some examples---remember the  | 
         | 
   382 catastrophic backtracking from earlier lectures. This depth-first  | 
         | 
   383 search with backtracking is the reason for the abysmal performance of  | 
         | 
   384 regular expression macthing in Java, Ruby and Python.  | 
   278   | 
   385   | 
   279 %This means if  | 
   386 %This means if  | 
   280 %we need to decide whether a string is accepted by a NFA, we might have  | 
   387 %we need to decide whether a string is accepted by a NFA, we might have  | 
   281 %to explore all possibilities. Also there is the special silent  | 
   388 %to explore all possibilities. Also there is the special silent  | 
   282 %transition in NFAs. As mentioned already this transition means you do  | 
   389 %transition in NFAs. As mentioned already this transition means you do  |