handouts/ho03.tex
changeset 485 bd6d999ab7b6
parent 484 e61ffb28994d
child 487 a697421eaa04
equal deleted inserted replaced
484:e61ffb28994d 485:bd6d999ab7b6
    21 are easier to reason about and the notion of derivatives, although
    21 are easier to reason about and the notion of derivatives, although
    22 already quite old, only became more widely known rather
    22 already quite old, only became more widely known rather
    23 recently. Still let us in this lecture have a closer look at automata
    23 recently. Still let us in this lecture have a closer look at automata
    24 and their relation to regular expressions. This will help us with
    24 and their relation to regular expressions. This will help us with
    25 understanding why the regular expression matchers in Python, Ruby and
    25 understanding why the regular expression matchers in Python, Ruby and
    26 Java are so slow with certain regular expressions.
    26 Java are so slow with certain regular expressions. On the way we will
       
    27 also see what are the limitations of regular expressions.
    27 
    28 
    28 
    29 
    29 \subsection*{Deterministic Finite Automata}
    30 \subsection*{Deterministic Finite Automata}
    30 
    31 
    31 The central definition is:\medskip
    32 Lets start\ldots the central definition is:\medskip
    32 
    33 
    33 \noindent 
    34 \noindent 
    34 A \emph{deterministic finite automaton} (DFA), say $A$, is
    35 A \emph{deterministic finite automaton} (DFA), say $A$, is
    35 given by a five-tuple written ${\cal A}(\varSigma, Qs, Q_0, F, \delta)$ where
    36 given by a five-tuple written ${\cal A}(\varSigma, Qs, Q_0, F, \delta)$ where
    36 
    37 
    74 \end{center}
    75 \end{center}
    75 
    76 
    76 \noindent In this graphical notation, the accepting state $Q_4$ is
    77 \noindent In this graphical notation, the accepting state $Q_4$ is
    77 indicated with double circles. Note that there can be more than one
    78 indicated with double circles. Note that there can be more than one
    78 accepting state. It is also possible that a DFA has no accepting
    79 accepting state. It is also possible that a DFA has no accepting
    79 states at all, or that the starting state is also an accepting
    80 state at all, or that the starting state is also an accepting
    80 state. In the case above the transition function is defined everywhere
    81 state. In the case above the transition function is defined everywhere
    81 and can also be given as a table as follows:
    82 and can also be given as a table as follows:
    82 
    83 
    83 \[
    84 \[
    84 \begin{array}{lcl}
    85 \begin{array}{lcl}
   105 \widehat{\delta}(q, c\!::\!s) & \dn & \widehat{\delta}(\delta(q, c), s)\\
   106 \widehat{\delta}(q, c\!::\!s) & \dn & \widehat{\delta}(\delta(q, c), s)\\
   106 \end{array}
   107 \end{array}
   107 \]
   108 \]
   108 
   109 
   109 \noindent This lifted transition function is often called
   110 \noindent This lifted transition function is often called
   110 ``delta-hat''. Given a string, we start in the starting state and take
   111 \emph{delta-hat}. Given a string, we start in the starting state and
   111 the first character of the string, follow to the next state, then take
   112 take the first character of the string, follow to the next state, then
   112 the second character and so on. Once the string is exhausted and we
   113 take the second character and so on. Once the string is exhausted and
   113 end up in an accepting state, then this string is accepted by the
   114 we end up in an accepting state, then this string is accepted by the
   114 automaton. Otherwise it is not accepted. This also means that if along
   115 automaton. Otherwise it is not accepted. This also means that if along
   115 the way we hit the case where the transition function $\delta$ is not
   116 the way we hit the case where the transition function $\delta$ is not
   116 defined, we need to raise an error. In our implementation we will deal
   117 defined, we need to raise an error. In our implementation we will deal
   117 with this case elegantly by using Scala's \texttt{Try}. So a string
   118 with this case elegantly by using Scala's \texttt{Try}.  Summing up: a
   118 $s$ is in the \emph{language accepted by the automaton} ${\cal
   119 string $s$ is in the \emph{language accepted by the automaton} ${\cal
   119   A}(\varSigma, Q, Q_0, F, \delta)$ iff
   120   A}(\varSigma, Q, Q_0, F, \delta)$ iff
   120 
   121 
   121 \[
   122 \[
   122 \widehat{\delta}(Q_0, s) \in F 
   123 \widehat{\delta}(Q_0, s) \in F 
   123 \]
   124 \]
   130 \lstinputlisting[numbers=left,linebackgroundcolor=
   131 \lstinputlisting[numbers=left,linebackgroundcolor=
   131                   {\ifodd\value{lstnumber}\color{capri!3}\fi}]
   132                   {\ifodd\value{lstnumber}\color{capri!3}\fi}]
   132                   {../progs/dfa.scala}
   133                   {../progs/dfa.scala}
   133 \caption{A Scala implementation of DFAs using partial functions.
   134 \caption{A Scala implementation of DFAs using partial functions.
   134   Notice some subtleties: \texttt{deltas} implements the delta-hat
   135   Notice some subtleties: \texttt{deltas} implements the delta-hat
   135   construction by lifting the transition (partial) function to
   136   construction by lifting the transition (partial) function to lists
   136   lists of characters. Since \texttt{delta} is given
   137   of characters. Since \texttt{delta} is given as a partial function,
   137   as a partial function, it can obviously go ``wrong'' in which
   138   it can obviously go ``wrong'' in which case the \texttt{Try} in
   138   case the \texttt{Try} in \texttt{accepts} catches the error and
   139   \texttt{accepts} catches the error and returns \texttt{false}---that
   139   returns \texttt{false}---that means the string is not accepted.
   140   means the string is not accepted.  The example \texttt{delta} in
   140   The example \texttt{delta} implements the DFA example shown
   141   Line 28--38 implements the DFA example shown earlier in the
   141   earlier in the handout.\label{dfa}}
   142   handout.\label{dfa}}
   142 \end{figure}
   143 \end{figure}
   143 
   144 
   144 My take of a simple Scala implementation for DFAs is given in
   145 My take on an implementation of DFAs in Scala is given in
   145 Figure~\ref{dfa}. As you can see, there are many features of the
   146 Figure~\ref{dfa}. As you can see, there are many features of the
   146 mathematical definition that are quite closely reflected in the
   147 mathematical definition that are quite closely reflected in the
   147 code. In the DFA-class, there is a starting state, called
   148 code. In the DFA-class, there is a starting state, called
   148 \texttt{start}, with the polymorphic type \texttt{A}.  There is a
   149 \texttt{start}, with the polymorphic type \texttt{A}.  There is a
   149 partial function \texttt{delta} for specifying the transitions---these
   150 partial function \texttt{delta} for specifying the transitions---these
   150 partial functions take a state (of polymorphic type \texttt{A}) and an
   151 partial functions take a state (of polymorphic type \texttt{A}) and an
   151 input (of polymorphic type \texttt{C}) and produce a new state (of
   152 input (of polymorphic type \texttt{C}) and produce a new state (of
   152 type \texttt{A}). For the moment it is OK to assume that \texttt{A} is
   153 type \texttt{A}). For the moment it is OK to assume that \texttt{A} is
   153 some arbitrary type for states and the input is just characters.  (The
   154 some arbitrary type for states and the input is just characters.  (The
   154 reason for having polymorphic types for the states and the input of
   155 reason for not having concrete types, but polymorphic types for the
   155 DFAs will become clearer later on.)
   156 states and the input of DFAs will become clearer later on.)
   156 
   157 
   157 The most important point in this implemnetation is that I use Scala's
   158 The DFA-class has also an argument for specifying final states. In the
       
   159 implementation it is not a set of states, as in the mathematical
       
   160 definition, but a function from states to booleans (this function is
       
   161 supposed to return true whenever a state is final; false
       
   162 otherwise). While this boolean function is different from the sets of
       
   163 states, Scala allows to use sets for such functions (see Line 40 where
       
   164 the DFA is initialised). Again it will become clear later on why I use
       
   165 functions for final states, rather than sets.
       
   166 
       
   167 The most important point in the implementation is that I use Scala's
   158 partial functions for representing the transitions; alternatives would
   168 partial functions for representing the transitions; alternatives would
   159 have been \texttt{Maps} or even \texttt{Lists}. One of the main
   169 have been \texttt{Maps} or even \texttt{Lists}. One of the main
   160 advantages of using partial functions is that transitions can be quite
   170 advantages of using partial functions is that transitions can be quite
   161 nicely defined by a series of \texttt{case} statements (see Lines 28
   171 nicely defined by a series of \texttt{case} statements (see Lines 28
   162 -- 38 for an example). If you need to represent an automaton with a
   172 -- 38 for an example). If you need to represent an automaton with a
   174     case (S1, 'a') => S2
   184     case (S1, 'a') => S2
   175     case _ => Sink
   185     case _ => Sink
   176   }
   186   }
   177 \end{lstlisting}}  
   187 \end{lstlisting}}  
   178 
   188 
   179 \noindent I let you think what this DFA looks like in the graphical
   189 \noindent I let you think what the corresponding DFA looks like in the
   180 notation.
   190 graph notation.
   181 
   191 
   182 The DFA-class has also an argument for specifying final states. In the
   192 Finally, I let you ponder whether this is a good implementation of
   183 implementation it not a set of states, as in the matemathical
   193 DFAs or not. In doing so I hope you notice that the $\varSigma$ and
   184 definition, but a function from states to booleans (this function is
   194 $Qs$ components (the alphabet and the set of finite states,
   185 supposed to return true whenever a state is final; false
   195 respectively) are missing from the class definition. This means that
   186 otherwise). While this boolean function is different from the sets of
   196 the implementation allows you to do some fishy things you are not
   187 states, Scala allows to use sets for such functions (see Line 40 where
   197 meant to do with DFAs. Which fishy things could that be?
   188 the DFA is initialised). Again it will become clear later on why I use
       
   189 functions for final states, rather than sets.
       
   190 
       
   191 I let you ponder whether this is a good implementation of DFAs. In
       
   192 doing so I hope you notice that the $\varSigma$ and $Qs$ components (the
       
   193 alphabet and the set of finite states, respectively) are missing from
       
   194 the class definition. This means that the implementation allows you to
       
   195 do some fishy things you are not meant to do with DFAs. Which fishy
       
   196 things could that be?
       
   197 
   198 
   198 
   199 
   199 
   200 
   200 \subsection*{Non-Deterministic Finite Automata}
   201 \subsection*{Non-Deterministic Finite Automata}
   201 
   202 
   202 While with DFAs it is always be clear that given a state and a
   203 Remember we want to find out what the relation is between regular
   203 character what the next state is (potentially none), it will be useful
   204 expressions and automata. To do this with DFAs is a bit unwieldy.
   204 to relax this restriction. That means we allow states to have several
   205 While with DFAs it is always clear that given a state and a character
       
   206 what the next state is (potentially none), it will be convenient to
       
   207 relax this restriction. That means we allow states to have several
   205 potential successor states. We even allow more than one starting
   208 potential successor states. We even allow more than one starting
   206 state. The resulting construction is called a \emph{Non-Deterministic
   209 state. The resulting construction is called a \emph{Non-Deterministic
   207   Finite Automaton} (NFA) given also as a five-tuple ${\cal A}(\varSigma,
   210   Finite Automaton} (NFA) given also as a five-tuple ${\cal
   208 Qs, Q_{0s}, F, \rho)$ where
   211   A}(\varSigma, Qs, Q_{0s}, F, \rho)$ where
   209 
   212 
   210 \begin{itemize}
   213 \begin{itemize}
   211 \item $\varSigma$ is an alphabet,  
   214 \item $\varSigma$ is an alphabet,  
   212 \item $Qs$ is a finite set of states
   215 \item $Qs$ is a finite set of states
   213 \item $Q_{0s}$ is a set of start states ($Q_{0s} \subseteq Qs$)
   216 \item $Q_{0s}$ is a set of start states ($Q_{0s} \subseteq Qs$)
   244 this choice is that when it comes to deciding whether a string is
   247 this choice is that when it comes to deciding whether a string is
   245 accepted by a NFA we potentially have to explore all possibilities. I
   248 accepted by a NFA we potentially have to explore all possibilities. I
   246 let you think which kind of strings the above NFA accepts.
   249 let you think which kind of strings the above NFA accepts.
   247 
   250 
   248 
   251 
   249 There are a number of additional points you should note with
   252 There are a number of additional points you should note about
   250 NFAs. Every DFA is a NFA, but not vice versa. The $\rho$ in NFAs is a
   253 NFAs. Every DFA is a NFA, but not vice versa. The $\rho$ in NFAs is a
   251 transition \emph{relation} (DFAs have a transition function). The
   254 transition \emph{relation} (DFAs have a transition function). The
   252 difference between a function and a relation is that a function has
   255 difference between a function and a relation is that a function has
   253 always a single output, while a relation gives, roughly speaking,
   256 always a single output, while a relation gives, roughly speaking,
   254 several outputs. Look again at the NFA above: if you are currently in
   257 several outputs. Look again at the NFA above: if you are currently in
   257 not determined.  This non-determinism can be represented by a
   260 not determined.  This non-determinism can be represented by a
   258 relation.
   261 relation.
   259 
   262 
   260 My implementation of NFAs in Scala is shown in Figure~\ref{nfa}.
   263 My implementation of NFAs in Scala is shown in Figure~\ref{nfa}.
   261 Perhaps interestingly, I do not actually use relations for my NFAs,
   264 Perhaps interestingly, I do not actually use relations for my NFAs,
   262 and I also do not use transition functions that return sets of states
   265 but use transition functions that return sets of states.  DFAs have
   263 (another popular choice for implementing NFAs).  For reasons that
   266 partial transition functions that return a single state; my NFAs
   264 become clear in a moment, I use sets of partial functions
   267 return a set. I let you think about this representation for
   265 instead---see Line 7 in Figure~\ref{nfa}. DFAs have only one such
   268 NFA-transitions and how it corresponds to the relations used in the
   266 partial function; my NFAs have a set.  Another parameter,
   269 mathematical definition of NFAs.
   267 \texttt{starts}, is in NFAs a set of states; \texttt{fins} is again a
   270 
   268 function from states to booleans. The \texttt{next} function
   271 Like in the mathematical definition, \texttt{starts} is in NFAs a set
   269 calculates the set of next states reachable from a single state
   272 of states; \texttt{fins} is again a function from states to
   270 \texttt{q} by a character~\texttt{c}---this is calculated by going
   273 booleans. The \texttt{next} function calculates the set of next states
   271 through all the partial functions in the \texttt{delta}-set and apply
   274 reachable from a single state \texttt{q} by a character~\texttt{c}. In
   272 \texttt{q} and \texttt{c} (see Line 13). This gives a set of
   275 case there is no such state---the partial transition function is
   273 \texttt{Some}s (in case the application succeeded) and possibly some
   276 undefined---the empty set is returned (see function
   274 \texttt{None}s (in case the partial function is not defined or produces an
   277 \texttt{applyOrElse} in Lines 9 and 10). The function \texttt{nexts}
   275 error).  The \texttt{None}s are filtered out by the \texttt{flatMap},
   278 just lifts this function to sets of states.
   276 leaving the values inside the \texttt{Some}s. The function
   279  
   277 \texttt{nexts} just lifts this function to sets of
       
   278 states. \texttt{Deltas} and \texttt{accept} are similar to the DFA
       
   279 definitions.
       
   280 
       
   281 \begin{figure}[p]
   280 \begin{figure}[p]
   282 \small
   281 \small
   283 \lstinputlisting[numbers=left,linebackgroundcolor=
   282 \lstinputlisting[numbers=left,linebackgroundcolor=
   284                   {\ifodd\value{lstnumber}\color{capri!3}\fi}]
   283                   {\ifodd\value{lstnumber}\color{capri!3}\fi}]
   285                   {../progs/nfa.scala}
   284                   {../progs/nfa.scala}
   286 \caption{A Scala implementation of NFAs using sets of partial functions.
   285 \caption{A Scala implementation of NFAs using partial functions.
   287   Notice some subtleties: Since \texttt{delta} is given
   286   Notice that the function \texttt{accepts} implements the
   288   as a set of partial functions, each of them can obviously go ``wrong'' in which
   287   acceptance of a string in a breath-first search fashion. This can be a costly
   289   case the \texttt{Try}. The function \texttt{accepts} implements the
   288   way of deciding whether a string is accepted or not in applications that need to handle
   290   acceptance of a string in a breath-first fashion. This can be costly
   289   large NFAs and large inputs.\label{nfa}}
   291   way of deciding whether a string is accepted in practical contexts.\label{nfa}}
       
   292 \end{figure}
   290 \end{figure}
   293 
   291 
   294 The reason for using sets of partial functions for specifying the
   292 Look very careful at the \texttt{accepts} and \texttt{deltas}
   295 transitions in NFAs has to do with pattern matching. Consider the
   293 functions in NFAs and remember that when accepting a string by a NFA
   296 following example: a popular benchmark regular expression is
       
   297 $(.)^*\cdot a\cdot (.)^{\{n\}}\cdot b\cdot c$. The important point to
       
   298 note is that it uses $.$ in order to represent the regular expression
       
   299 that accepts any character. A NFA that accepts the same strings as
       
   300 this regular expression (for $n=3$) is as follows:
       
   301 
       
   302 \begin{center}
       
   303 \begin{tikzpicture}[>=stealth',very thick, auto, node distance=7mm,
       
   304     every state/.style={minimum size=0pt,inner sep=1pt,
       
   305       draw=blue!50,very thick,fill=blue!20},scale=0.5]
       
   306 \node[state,initial]  (Q_0)  {$Q_0$};
       
   307 \node[state] (Q_1) [right=of Q_0] {$Q_1$};
       
   308 \node[state] (Q_2) [right=of Q_1] {$Q_2$};
       
   309 \node[state] (Q_3) [right=of Q_2] {$Q_3$};
       
   310 \node[state] (Q_4) [right=of Q_3] {$Q_4$};
       
   311 \node[state] (Q_5) [right=of Q_4] {$Q_5$};
       
   312 \node[state,accepting] (Q_6) [right=of Q_5] {$Q_6$};
       
   313 \path[->] (Q_0) edge [loop above] node  {$.$} ();
       
   314 \path[->] (Q_0) edge node [above]  {$a$} (Q_1);
       
   315 \path[->] (Q_1) edge node [above]  {$.$} (Q_2);
       
   316 \path[->] (Q_2) edge node [above]  {$.$} (Q_3);
       
   317 \path[->] (Q_3) edge node [above]  {$.$} (Q_4);
       
   318 \path[->] (Q_4) edge node [above]  {$b$} (Q_5);
       
   319 \path[->] (Q_5) edge node [above]  {$c$} (Q_6);
       
   320 \end{tikzpicture}
       
   321 \end{center}
       
   322 
       
   323 \noindent
       
   324 Also here the $.$ stands for accepting any single character: for example if we
       
   325 are in $Q_0$ and read an $a$ we can either stay in $Q_0$ (since any
       
   326 character will do for this) or advance to $Q_1$ (but only if it is an
       
   327 $a$). Why this is a good benchmark regular expression is irrelevant
       
   328 here. The point is that this NFA can be conveniently represented by
       
   329 the code:
       
   330 
       
   331 {\small\begin{lstlisting}[language=Scala,linebackgroundcolor=
       
   332                   {\ifodd\value{lstnumber}\color{capri!3}\fi}]
       
   333 val delta = Set[(State, Char) :=> State](
       
   334   { case (Q0, 'a') => Q1 },
       
   335   { case (Q0, _)   => Q0 },
       
   336   { case (Q1, _)   => Q2 },
       
   337   { case (Q2, _)   => Q3 },
       
   338   { case (Q3, _)   => Q4 },
       
   339   { case (Q4, 'b') => Q5 },
       
   340   { case (Q5, 'c') => Q6 }
       
   341 )
       
   342 
       
   343 NFA(Set[State](Q0), delta, Set[State](Q6))
       
   344 \end{lstlisting}}
       
   345 
       
   346 \noindent
       
   347 where the $.$-transitions translate into a
       
   348 underscore-pattern-matching. Recall that in $Q_0$ if we read an $a$ we
       
   349 can go to $Q_1$ (by the first partial function in the set) and also
       
   350 stay in $Q_0$ (by the second partial function). Representing such
       
   351 transitions in any other way in Scala seems to be somehow awkward; the
       
   352 set of partial function representation makes them easy to implement.
       
   353 
       
   354 Look very careful again at the \texttt{accepts} and \texttt{deltas}
       
   355 functions in NFAs and remember that when accepting a string by an NFA
       
   356 we might have to explore all possible transitions (recall which state
   294 we might have to explore all possible transitions (recall which state
   357 to go to is not unique anymore with NFAs). The implementation achieves
   295 to go to is not unique anymore with NFAs\ldots{}we need to explore
   358 this exploration in a \emph{breadth-first search} manner. This is fine
   296 potentially all next states). The implementation achieves this
   359 for very small NFAs, but can lead to problems when the NFAs are
   297 exploration in a \emph{breadth-first search} manner. This is fine for
   360 bigger. Take for example the regular expression $(.)^*\cdot a\cdot
   298 small NFAs, but can lead to real memory problems when the NFAs are
   361 (.)^{\{n\}}\cdot b\cdot c$ from above. If $n$ is large, say 100 or
   299 bigger and larger strings need to be processed. As result, some
   362 1000, then the corresponding NFA will have 104, respectively 1004,
   300 regular expression matching engines resort to a \emph{depth-first
   363 nodes. The problem is that with certain strings this can lead to 1000
   301   search} with \emph{backtracking} in unsuccessful cases. In our
   364 ``active'' nodes in the breadth-first search, all of which we need to
   302 implementation we can implement a depth-first version of
   365 analyse when determining the next states. This can be a real memory
   303 \texttt{accepts} using Scala's \texttt{exists}-function as follows:
   366 strain in practical applications. As result, some regular expression
       
   367 matching engines resort to a \emph{depth-first search} with
       
   368 \emph{backtracking} in unsuccessful cases. In our implementation we
       
   369 can implement a depth-first version of \texttt{accepts} using Scala's
       
   370 \texttt{exists} as follows:
       
   371 
   304 
   372 
   305 
   373 {\small\begin{lstlisting}[language=Scala,linebackgroundcolor=
   306 {\small\begin{lstlisting}[language=Scala,linebackgroundcolor=
   374                   {\ifodd\value{lstnumber}\color{capri!3}\fi}]
   307                   {\ifodd\value{lstnumber}\color{capri!3}\fi}]
   375 def search(q: A, s: List[C]) : Boolean = s match {
   308 def search(q: A, s: List[C]) : Boolean = s match {
   376   case Nil => fins(q)
   309   case Nil => fins(q)
   377   case c::cs =>
   310   case c::cs => next(q, c).exists(search(_, cs)) 
   378     delta.exists(d => Try(search(d(q, c), cs)) getOrElse false)
       
   379 }
   311 }
   380 
   312 
   381 def accepts(s: List[C]) : Boolean = 
   313 def accepts2(s: List[C]) : Boolean = 
   382   starts.exists(search(_, s))
   314   starts.exists(search(_, s))
   383 \end{lstlisting}}
   315 \end{lstlisting}}
   384 
   316 
   385 \noindent
   317 \noindent
   386 This depth-first way of exploration seems to work efficiently in many
   318 This depth-first way of exploration seems to work efficiently in many
   387 examples and is much less of strain on memory. The problem is that the
   319 examples and is much less of a strain on memory. The problem is that
   388 backtracking can get ``catastrophic'' in some examples---remember the
   320 the backtracking can get ``catastrophic'' in some examples---remember
   389 catastrophic backtracking from earlier lectures. This depth-first
   321 the catastrophic backtracking from earlier lectures. This depth-first
   390 search with backtracking is the reason for the abysmal performance of
   322 search with backtracking is the reason for the abysmal performance of
   391 some regular expression macthings in Java, Ruby and Python. I like to
   323 some regular expression matchings in Java, Ruby and Python. I like to
   392 show you this next.
   324 show you this next.
   393 
   325 
   394 %This means if
       
   395 %we need to decide whether a string is accepted by a NFA, we might have
       
   396 %to explore all possibilities. Also there is the special silent
       
   397 %transition in NFAs. As mentioned already this transition means you do
       
   398 %not have to ``consume'' any part of the input string, but ``silently''
       
   399 %change to a different state. In the left picture, for example, if you
       
   400 %are in the starting state, you can silently move either to $Q_1$ or
       
   401 %%$Q_2$. This silent transition is also often called
       
   402 %\emph{$\epsilon$-transition}.
       
   403 
       
   404 
   326 
   405 \subsubsection*{Thompson Construction}
   327 \subsubsection*{Thompson Construction}
   406 
   328 
   407 In order to get an idea what calculations are done in Java \& friends,
   329 In order to get an idea what calculations are performed by Java \&
   408 we need a method for translating regular expressions into
   330 friends, we need a method for transforming a regular expression into
   409 automata. The simplest and most well-known method is called
   331 an automaton. This automaton should accept exactly those strings that
   410 \emph{Thompson Construction}, after the Turing Award winner Ken
   332 are accepted by the regular expression.  The simplest and most
   411 Thompson who implemented this method in early versions of grep????
   333 well-known method for this is called \emph{Thompson Construction},
   412 
   334 after the Turing Award winner Ken Thompson. This method is by
   413 
   335 recursion over regular expressions and uses the non-determinism in
   414 The reason for introducing NFAs is that there is a relatively
   336 NFAs. You will see shortly why this construction works well with NFAs,
   415 simple (recursive) translation of regular expressions into
   337 but is not so straightforward with DFAs. Unfortunately we are still
   416 NFAs. Consider the simple regular expressions $\ZERO$,
   338 one step away from our intended target though---because this
   417 $\ONE$ and $c$. They can be translated as follows:
   339 construction uses a version of NFAs that allows ``silent
       
   340 transitions''. The idea behind silent transitions is that they
       
   341 allow us to go from one state to the next without having to consume a
       
   342 character. We label such silent transition with the letter
       
   343 $\epsilon$ and call the automata $\epsilon$NFAs. Two
       
   344 typical examples of $\epsilon$NFAs are
       
   345 
       
   346 
       
   347 \begin{center}
       
   348 \begin{tabular}[t]{c@{\hspace{9mm}}c}
       
   349 \begin{tikzpicture}[>=stealth',very thick,
       
   350     every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]
       
   351 \node[state,initial]  (Q_0)  {$Q_0$};
       
   352 \node[state] (Q_1) [above=of Q_0] {$Q_1$};
       
   353 \node[state, accepting] (Q_2) [below=of Q_0] {$Q_2$};
       
   354 \path[->] (Q_0) edge node [left]  {$\epsilon$} (Q_1);
       
   355 \path[->] (Q_0) edge node [left]  {$\epsilon$} (Q_2);
       
   356 \path[->] (Q_0) edge [loop right] node  {$a$} ();
       
   357 \path[->] (Q_1) edge [loop right] node  {$a$} ();
       
   358 \path[->] (Q_2) edge [loop right] node  {$b$} ();
       
   359 \end{tikzpicture} &
       
   360 
       
   361 \raisebox{20mm}{
       
   362 \begin{tikzpicture}[>=stealth',very thick,
       
   363     every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]
       
   364 \node[state,initial]  (r_1)  {$R_1$};
       
   365 \node[state] (r_2) [above=of r_1] {$R_2$};
       
   366 \node[state, accepting] (r_3) [right=of r_1] {$R_3$};
       
   367 \path[->] (r_1) edge node [below]  {$b$} (r_3);
       
   368 \path[->] (r_2) edge [bend left] node [above]  {$a$} (r_3);
       
   369 \path[->] (r_1) edge [bend left] node  [left] {$\epsilon$} (r_2);
       
   370 \path[->] (r_2) edge [bend left] node  [right] {$a$} (r_1);
       
   371 \end{tikzpicture}}
       
   372 \end{tabular}
       
   373 \end{center}
       
   374 
       
   375 \noindent
       
   376 Consider the $\epsilon$NFA on the left-hand side: an
       
   377 $\epsilon$-transition means you do not have to ``consume'' any part of
       
   378 the input string, but ``silently'' change to a different state. For
       
   379 example, if you are in the starting state $Q_0$, you can silently move
       
   380 either to $Q_1$ or $Q_2$. In this example you can see that once you
       
   381 are in $Q_1$, respectively $Q_2$, you cannot ``go back'' to the other
       
   382 states.
       
   383 
       
   384 On first appearances, $\epsilon$-transitions might look rather
       
   385 strange, or even silly. To start with, silent transitions make the
       
   386 decision whether a string is accepted by an automaton even harder:
       
   387 with $\epsilon$NFAs we have to look whether we can do first some
       
   388 $\epsilon$-transitions and then do a ``proper''-transition; and after
       
   389 any ``proper''-transition we again have to check whether we can do
       
   390 again some silent transitions. Even worse, if there is a silent
       
   391 transition pointing back to the same state, then we have to be careful
       
   392 our decision procedure for strings does not loop (remember the
       
   393 depth-first search for exploring all states).
       
   394 
       
   395 The obvious question is: Do we get anything in return for this hassle
       
   396 with silent transitions?  Well, we still have to work for it\ldots
       
   397 unfortunately.  If we were to follow the many textbooks on the
       
   398 subject, we would now start with defining what $\epsilon$NFAs
       
   399 are---that would require extending the transition relation of
       
   400 NFAs. Next, show that the $\epsilon$NFAs are equivalent to NFAs and so
       
   401 on. Once we have done all this on paper, we would need to implement
       
   402 $\epsilon$NFAs. Lets try to take a shortcut---we are not really
       
   403 interested in $\epsilon$NFAs; they are only a convenient tool for
       
   404 translating regular expressions into automata. We are not going to
       
   405 implementing them explicitly, but translate them directly into NFAs
       
   406 (in a sense $\epsilon$NFAs are just a convenient API for lazy people).
       
   407 How does the translation work: well we have to find all transitions of
       
   408 the form
       
   409 
       
   410 \[
       
   411 q\stackrel{\epsilon}{\longrightarrow}\ldots\stackrel{\epsilon}{\longrightarrow}
       
   412 \;\stackrel{a}{\longrightarrow}\;
       
   413 \stackrel{\epsilon}{\longrightarrow}\ldots\stackrel{\epsilon}{\longrightarrow} q'
       
   414 \]
       
   415 
       
   416 \noindent and replace them with $q \stackrel{a}{\longrightarrow}
       
   417 q'$. Doing this to the $\epsilon$NFA on the left-hand side above gives
       
   418 the NFA
       
   419 
       
   420 \begin{center}
       
   421 \begin{tikzpicture}[>=stealth',very thick,
       
   422     every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]
       
   423 \node[state,initial]  (r_1)  {$R_1$};
       
   424 \node[state] (r_2) [above=of r_1] {$R_2$};
       
   425 \node[state, accepting] (r_3) [right=of r_1] {$R_3$};
       
   426 \path[->] (r_1) edge node [above]  {$b$} (r_3);
       
   427 \path[->] (r_2) edge [bend left] node [above]  {$a$} (r_3);
       
   428 \path[->] (r_1) edge [bend left] node  [left] {$a$} (r_2);
       
   429 \path[->] (r_2) edge [bend left] node  [right] {$a$} (r_1);
       
   430 \path[->] (r_1) edge [loop below] node  {$a$} ();
       
   431 \path[->] (r_1) edge [bend right] node [below]  {$a$} (r_3);
       
   432 \end{tikzpicture}
       
   433 \end{center}
       
   434 
       
   435 \noindent where the single the $\epsilon$-transition is replaced by
       
   436 three more $a$-transitions. So whenever we are given $\epsilon$NFA we
       
   437 will, similarly, replace it by an equivalent NFA. The code for this is
       
   438 given in Figure~\ref{enfa}. At the core is a function that calculates
       
   439 a fixpoint of function (Lines 5--10). This is used for ``discovering''
       
   440 states reachable by $\epsilon$-transitions. Once no new state is
       
   441 reachable state, a fixpoint is reached. This is used for example when
       
   442 calculating the starting states of an equivalent NFA---see Line 36.
       
   443 We starts with all starting states of the $\epsilon$NFA and then look
       
   444 for all other states that can be reached by
       
   445 $\epsilon$-transition. This is what the $\epsilon$-closure, called
       
   446 \texttt{ecl}, calculates.
       
   447 
       
   448 \begin{figure}[p]
       
   449 \small
       
   450 \lstinputlisting[numbers=left,linebackgroundcolor=
       
   451                   {\ifodd\value{lstnumber}\color{capri!3}\fi}]
       
   452                   {../progs/enfa.scala}
       
   453 
       
   454 \caption{A Scala function that translates $\epsilon$NFA into NFAs. The
       
   455   transtions of $\epsilon$NFA take as input an \texttt{Option[C]}.
       
   456   \texttt{None} stands for an $\epsilon$-transition; \texttt{Some(c)}
       
   457   for a ``proper'' transition. The functions in Lines 18--26 calculate
       
   458   all states reachable by one or more $\epsilon$-transition for a given
       
   459   set of states. The NFA is constructed in in Lines 36--38.\label{enfa}}
       
   460 \end{figure}
       
   461 
       
   462 
       
   463 Now having a translation of $\epsilon$NFAs to NFAs in place, we can
       
   464 finally make headway with the problem of translating regular
       
   465 expressions into equivalent NFAs. By equivalent we mean that the NFAs
       
   466 recognise the same language. Consider the simple regular expressions
       
   467 $\ZERO$, $\ONE$ and $c$. They can be translated into equivalent NFAs
       
   468 as follows:
   418 
   469 
   419 \begin{center}
   470 \begin{center}
   420 \begin{tabular}[t]{l@{\hspace{10mm}}l}
   471 \begin{tabular}[t]{l@{\hspace{10mm}}l}
   421 \raisebox{1mm}{$\ZERO$} & 
   472 \raisebox{1mm}{$\ZERO$} & 
   422 \begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
   473 \begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
   432 \node[state, accepting]  (Q_1)  [right=of Q_0] {$\mbox{}$};
   483 \node[state, accepting]  (Q_1)  [right=of Q_0] {$\mbox{}$};
   433 \path[->] (Q_0) edge node [below]  {$c$} (Q_1);
   484 \path[->] (Q_0) edge node [below]  {$c$} (Q_1);
   434 \end{tikzpicture}\\\\
   485 \end{tikzpicture}\\\\
   435 \end{tabular}
   486 \end{tabular}
   436 \end{center}
   487 \end{center}
       
   488 
       
   489 \begin{figure}[p]
       
   490 \small
       
   491 \lstinputlisting[numbers=left,linebackgroundcolor=
       
   492                   {\ifodd\value{lstnumber}\color{capri!3}\fi}]
       
   493                   {../progs/thompson.scala}
       
   494 \caption{A Scala XXX\label{thompson}}
       
   495 \end{figure}
       
   496 
   437 
   497 
   438 \noindent The case for the sequence regular expression $r_1
   498 \noindent The case for the sequence regular expression $r_1
   439 \cdot r_2$ is as follows: We are given by recursion two
   499 \cdot r_2$ is as follows: We are given by recursion two
   440 automata representing $r_1$ and $r_2$ respectively. 
   500 automata representing $r_1$ and $r_2$ respectively. 
   441 
   501 
   604 \subsubsection*{Subset Construction}
   664 \subsubsection*{Subset Construction}
   605 
   665 
   606 What is interesting is that for every NFA we can find a DFA
   666 What is interesting is that for every NFA we can find a DFA
   607 which recognises the same language. This can, for example, be
   667 which recognises the same language. This can, for example, be
   608 done by the \emph{subset construction}. Consider again the NFA
   668 done by the \emph{subset construction}. Consider again the NFA
   609 below on the left, consisting of nodes labeled $0$, $1$ and $2$. 
   669 below on the left, consisting of nodes labelled $0$, $1$ and $2$. 
   610 
   670 
   611 \begin{center}
   671 \begin{center}
   612 \begin{tabular}{c@{\hspace{10mm}}c}
   672 \begin{tabular}{c@{\hspace{10mm}}c}
   613 \begin{tikzpicture}[scale=0.7,>=stealth',very thick,
   673 \begin{tikzpicture}[scale=0.7,>=stealth',very thick,
   614                     every state/.style={minimum size=0pt,
   674                     every state/.style={minimum size=0pt,
   792 state followed by character---in this case $Q_0\,a$. Now lets
   852 state followed by character---in this case $Q_0\,a$. Now lets
   793 have a look at the third equation: there are two incoming
   853 have a look at the third equation: there are two incoming
   794 edges for $Q_2$. Therefore we have two terms, namely $Q_1\,a$ and
   854 edges for $Q_2$. Therefore we have two terms, namely $Q_1\,a$ and
   795 $Q_2\,a$. These terms are separated by $+$. The first states
   855 $Q_2\,a$. These terms are separated by $+$. The first states
   796 that if in state $Q_1$ consuming an $a$ will bring you to
   856 that if in state $Q_1$ consuming an $a$ will bring you to
   797 $Q_2$, and the secont that being in $Q_2$ and consuming an $a$
   857 $Q_2$, and the second that being in $Q_2$ and consuming an $a$
   798 will make you stay in $Q_2$. The right-hand side of the
   858 will make you stay in $Q_2$. The right-hand side of the
   799 first equation is constructed similarly: there are three 
   859 first equation is constructed similarly: there are three 
   800 incoming edges, therefore there are three terms. There is
   860 incoming edges, therefore there are three terms. There is
   801 one exception in that we also ``add'' $\ONE$ to the
   861 one exception in that we also ``add'' $\ONE$ to the
   802 first equation, because it corresponds to the starting state
   862 first equation, because it corresponds to the starting state
   813 \begin{eqnarray}
   873 \begin{eqnarray}
   814 Q_0 & = & \ONE + Q_0\,b + Q_0\,a\,b +  Q_2\,b\\
   874 Q_0 & = & \ONE + Q_0\,b + Q_0\,a\,b +  Q_2\,b\\
   815 Q_2 & = & Q_0\,a\,a + Q_2\,a
   875 Q_2 & = & Q_0\,a\,a + Q_2\,a
   816 \end{eqnarray}
   876 \end{eqnarray}
   817 
   877 
   818 \noindent where in Equation (4) we have two occurences
   878 \noindent where in Equation (4) we have two occurrences
   819 of $Q_0$. Like the laws about $+$ and $\cdot$, we can simplify 
   879 of $Q_0$. Like the laws about $+$ and $\cdot$, we can simplify 
   820 Equation (4) to obtain the following two equations:
   880 Equation (4) to obtain the following two equations:
   821 
   881 
   822 \begin{eqnarray}
   882 \begin{eqnarray}
   823 Q_0 & = & \ONE + Q_0\,(b + a\,b) +  Q_2\,b\\
   883 Q_0 & = & \ONE + Q_0\,(b + a\,b) +  Q_2\,b\\
  1163 %%% mode: latex
  1223 %%% mode: latex
  1164 %%% TeX-master: t
  1224 %%% TeX-master: t
  1165 %%% End: 
  1225 %%% End: 
  1166 
  1226 
  1167 
  1227 
  1168 \noindent
       
  1169 Two typical examples of NFAs are
       
  1170 \begin{center}
       
  1171 \begin{tabular}[t]{c@{\hspace{9mm}}c}
       
  1172 \begin{tikzpicture}[scale=0.7,>=stealth',very thick,
       
  1173                              every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]
       
  1174 \node[state,initial]  (Q_0)  {$Q_0$};
       
  1175 \node[state] (Q_1) [above=of Q_0] {$Q_1$};
       
  1176 \node[state, accepting] (Q_2) [below=of Q_0] {$Q_2$};
       
  1177 \path[->] (Q_0) edge node [left]  {$\epsilon$} (Q_1);
       
  1178 \path[->] (Q_0) edge node [left]  {$\epsilon$} (Q_2);
       
  1179 \path[->] (Q_0) edge [loop right] node  {$a$} ();
       
  1180 \path[->] (Q_1) edge [loop above] node  {$a$} ();
       
  1181 \path[->] (Q_2) edge [loop below] node  {$b$} ();
       
  1182 \end{tikzpicture} &
       
  1183 
       
  1184 \raisebox{20mm}{
       
  1185 \begin{tikzpicture}[scale=0.7,>=stealth',very thick,
       
  1186                              every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]
       
  1187 \node[state,initial]  (r_1)  {$r_1$};
       
  1188 \node[state] (r_2) [above=of r_1] {$r_2$};
       
  1189 \node[state, accepting] (r_3) [right=of r_1] {$r_3$};
       
  1190 \path[->] (r_1) edge node [below]  {$b$} (r_3);
       
  1191 \path[->] (r_2) edge [bend left] node [above]  {$a$} (r_3);
       
  1192 \path[->] (r_1) edge [bend left] node  [left] {$\epsilon$} (r_2);
       
  1193 \path[->] (r_2) edge [bend left] node  [right] {$a$} (r_1);
       
  1194 \end{tikzpicture}}
       
  1195 \end{tabular}
       
  1196 \end{center}