afl-material: handouts/ho03.tex@3506b1718c08 (annotated)

140 1be892087df2 added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	1	\documentclass{article}
251 5b5a68df6d16 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 217 diff changeset	2	\usepackage{../style}
5b5a68df6d16 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 217 diff changeset	3	\usepackage{../langs}
268 18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	4	\usepackage{../graphics}
140 1be892087df2 added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	5
480 14318f1d3b0f updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	6
140 1be892087df2 added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	7	\begin{document}
480 14318f1d3b0f updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	8	\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017}
140 1be892087df2 added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	9
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	10	\section*{Handout 3 (Finite Automata)}
140 1be892087df2 added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	11
488 057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	12
480 14318f1d3b0f updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	13	Every formal language and compiler course I know of bombards you first
14318f1d3b0f updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	14	with automata and then to a much, much smaller extend with regular
14318f1d3b0f updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	15	expressions. As you can see, this course is turned upside down:
14318f1d3b0f updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	16	regular expressions come first. The reason is that regular expressions
14318f1d3b0f updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	17	are easier to reason about and the notion of derivatives, although
14318f1d3b0f updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	18	already quite old, only became more widely known rather
488 057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	19	recently. Still, let us in this lecture have a closer look at automata
480 14318f1d3b0f updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	20	and their relation to regular expressions. This will help us with
14318f1d3b0f updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	21	understanding why the regular expression matchers in Python, Ruby and
485 21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	22	Java are so slow with certain regular expressions. On the way we will
488 057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	23	also see what are the limitations of regular
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	24	expressions. Unfortunately, they cannot be used for \emph{everything}.
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	25
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	26
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	27	\subsection*{Deterministic Finite Automata}
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	28
485 21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	29	Lets start\ldots the central definition is:\medskip
142 1aa28135a2da added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 141 diff changeset	30
268 18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	31	\noindent
251 5b5a68df6d16 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 217 diff changeset	32	A \emph{deterministic finite automaton} (DFA), say $A$, is
484 8182eb3278e0 updated Christian Urban <urbanc@in.tum.de> parents: 483 diff changeset	33	given by a five-tuple written ${\cal A}(\varSigma, Qs, Q_0, F, \delta)$ where
142 1aa28135a2da added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 141 diff changeset	34
1aa28135a2da added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 141 diff changeset	35	\begin{itemize}
484 8182eb3278e0 updated Christian Urban <urbanc@in.tum.de> parents: 483 diff changeset	36	\item $\varSigma$ is an alphabet,
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	37	\item $Qs$ is a finite set of states,
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	38	\item $Q_0 \in Qs$ is the start state,
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	39	\item $F \subseteq Qs$ are the accepting states, and
142 1aa28135a2da added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 141 diff changeset	40	\item $\delta$ is the transition function.
1aa28135a2da added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 141 diff changeset	41	\end{itemize}
1aa28135a2da added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 141 diff changeset	42
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	43	\noindent I am sure you have seen this definition already
488 057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	44	before. The transition function determines how to ``transition'' from
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	45	one state to the next state with respect to a character. We have the
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	46	assumption that these transition functions do not need to be defined
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	47	everywhere: so it can be the case that given a character there is no
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	48	next state, in which case we need to raise a kind of ``failure
483 faba5360372c updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	49	exception''. That means actually we have \emph{partial} functions as
484 8182eb3278e0 updated Christian Urban <urbanc@in.tum.de> parents: 483 diff changeset	50	transitions---see the Scala implementation of DFAs later on. A
8182eb3278e0 updated Christian Urban <urbanc@in.tum.de> parents: 483 diff changeset	51	typical example of a DFA is
142 1aa28135a2da added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 141 diff changeset	52
1aa28135a2da added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 141 diff changeset	53	\begin{center}
1aa28135a2da added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 141 diff changeset	54	\begin{tikzpicture}[>=stealth',very thick,auto,
268 18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	55	every state/.style={minimum size=0pt,
18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	56	inner sep=2pt,draw=blue!50,very thick,
18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	57	fill=blue!20},scale=2]
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	58	\node[state,initial] (Q_0) {$Q_0$};
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	59	\node[state] (Q_1) [right=of Q_0] {$Q_1$};
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	60	\node[state] (Q_2) [below right=of Q_0] {$Q_2$};
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	61	\node[state] (Q_3) [right=of Q_2] {$Q_3$};
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	62	\node[state, accepting] (Q_4) [right=of Q_1] {$Q_4$};
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	63	\path[->] (Q_0) edge node [above] {$a$} (Q_1);
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	64	\path[->] (Q_1) edge node [above] {$a$} (Q_4);
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	65	\path[->] (Q_4) edge [loop right] node {$a, b$} ();
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	66	\path[->] (Q_3) edge node [right] {$a$} (Q_4);
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	67	\path[->] (Q_2) edge node [above] {$a$} (Q_3);
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	68	\path[->] (Q_1) edge node [right] {$b$} (Q_2);
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	69	\path[->] (Q_0) edge node [above] {$b$} (Q_2);
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	70	\path[->] (Q_2) edge [loop left] node {$b$} ();
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	71	\path[->] (Q_3) edge [bend left=95, looseness=1.3] node [below] {$b$} (Q_0);
142 1aa28135a2da added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 141 diff changeset	72	\end{tikzpicture}
1aa28135a2da added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 141 diff changeset	73	\end{center}
140 1be892087df2 added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	74
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	75	\noindent In this graphical notation, the accepting state $Q_4$ is
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	76	indicated with double circles. Note that there can be more than one
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	77	accepting state. It is also possible that a DFA has no accepting
485 21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	78	state at all, or that the starting state is also an accepting
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	79	state. In the case above the transition function is defined everywhere
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	80	and can also be given as a table as follows:
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	81
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	82	\[
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	83	\begin{array}{lcl}
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	84	(Q_0, a) &\rightarrow& Q_1\\
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	85	(Q_0, b) &\rightarrow& Q_2\\
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	86	(Q_1, a) &\rightarrow& Q_4\\
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	87	(Q_1, b) &\rightarrow& Q_2\\
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	88	(Q_2, a) &\rightarrow& Q_3\\
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	89	(Q_2, b) &\rightarrow& Q_2\\
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	90	(Q_3, a) &\rightarrow& Q_4\\
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	91	(Q_3, b) &\rightarrow& Q_0\\
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	92	(Q_4, a) &\rightarrow& Q_4\\
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	93	(Q_4, b) &\rightarrow& Q_4\\
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	94	\end{array}
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	95	\]
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	96
268 18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	97	We need to define the notion of what language is accepted by
18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	98	an automaton. For this we lift the transition function
18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	99	$\delta$ from characters to strings as follows:
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	100
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	101	\[
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	102	\begin{array}{lcl}
484 8182eb3278e0 updated Christian Urban <urbanc@in.tum.de> parents: 483 diff changeset	103	\widehat{\delta}(q, []) & \dn & q\\
8182eb3278e0 updated Christian Urban <urbanc@in.tum.de> parents: 483 diff changeset	104	\widehat{\delta}(q, c\!::\!s) & \dn & \widehat{\delta}(\delta(q, c), s)\\
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	105	\end{array}
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	106	\]
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	107
268 18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	108	\noindent This lifted transition function is often called
485 21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	109	\emph{delta-hat}. Given a string, we start in the starting state and
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	110	take the first character of the string, follow to the next state, then
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	111	take the second character and so on. Once the string is exhausted and
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	112	we end up in an accepting state, then this string is accepted by the
480 14318f1d3b0f updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	113	automaton. Otherwise it is not accepted. This also means that if along
14318f1d3b0f updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	114	the way we hit the case where the transition function $\delta$ is not
14318f1d3b0f updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	115	defined, we need to raise an error. In our implementation we will deal
485 21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	116	with this case elegantly by using Scala's \texttt{Try}. Summing up: a
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	117	string $s$ is in the \emph{language accepted by the automaton} ${\cal
484 8182eb3278e0 updated Christian Urban <urbanc@in.tum.de> parents: 483 diff changeset	118	A}(\varSigma, Q, Q_0, F, \delta)$ iff
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	119
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	120	\[
484 8182eb3278e0 updated Christian Urban <urbanc@in.tum.de> parents: 483 diff changeset	121	\widehat{\delta}(Q_0, s) \in F
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	122	\]
268 18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	123
488 057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	124	\noindent I let you think about a definition that describes the set of
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	125	all strings accepted by a deterministic finite automaton.
480 14318f1d3b0f updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	126
14318f1d3b0f updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	127	\begin{figure}[p]
14318f1d3b0f updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	128	\small
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	129	\lstinputlisting[numbers=left]{../progs/display/dfa.scala}
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	130	\caption{A Scala implementation of DFAs using partial functions.
488 057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	131	Note some subtleties: \texttt{deltas} implements the delta-hat
487 ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	132	construction by lifting the (partial) transition function to lists
485 21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	133	of characters. Since \texttt{delta} is given as a partial function,
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	134	it can obviously go ``wrong'' in which case the \texttt{Try} in
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	135	\texttt{accepts} catches the error and returns \texttt{false}---that
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	136	means the string is not accepted. The example \texttt{delta} in
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	137	Line 28--38 implements the DFA example shown earlier in the
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	138	handout.\label{dfa}}
480 14318f1d3b0f updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	139	\end{figure}
14318f1d3b0f updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	140
485 21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	141	My take on an implementation of DFAs in Scala is given in
483 faba5360372c updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	142	Figure~\ref{dfa}. As you can see, there are many features of the
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	143	mathematical definition that are quite closely reflected in the
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	144	code. In the DFA-class, there is a starting state, called
483 faba5360372c updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	145	\texttt{start}, with the polymorphic type \texttt{A}. There is a
faba5360372c updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	146	partial function \texttt{delta} for specifying the transitions---these
faba5360372c updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	147	partial functions take a state (of polymorphic type \texttt{A}) and an
faba5360372c updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	148	input (of polymorphic type \texttt{C}) and produce a new state (of
faba5360372c updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	149	type \texttt{A}). For the moment it is OK to assume that \texttt{A} is
faba5360372c updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	150	some arbitrary type for states and the input is just characters. (The
485 21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	151	reason for not having concrete types, but polymorphic types for the
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	152	states and the input of DFAs will become clearer later on.)
483 faba5360372c updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	153
485 21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	154	The DFA-class has also an argument for specifying final states. In the
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	155	implementation it is not a set of states, as in the mathematical
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	156	definition, but a function from states to booleans (this function is
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	157	supposed to return true whenever a state is final; false
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	158	otherwise). While this boolean function is different from the sets of
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	159	states, Scala allows to use sets for such functions (see Line 40 where
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	160	the DFA is initialised). Again it will become clear later on why I use
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	161	functions for final states, rather than sets.
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	162
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	163	The most important point in the implementation is that I use Scala's
484 8182eb3278e0 updated Christian Urban <urbanc@in.tum.de> parents: 483 diff changeset	164	partial functions for representing the transitions; alternatives would
8182eb3278e0 updated Christian Urban <urbanc@in.tum.de> parents: 483 diff changeset	165	have been \texttt{Maps} or even \texttt{Lists}. One of the main
8182eb3278e0 updated Christian Urban <urbanc@in.tum.de> parents: 483 diff changeset	166	advantages of using partial functions is that transitions can be quite
8182eb3278e0 updated Christian Urban <urbanc@in.tum.de> parents: 483 diff changeset	167	nicely defined by a series of \texttt{case} statements (see Lines 28
8182eb3278e0 updated Christian Urban <urbanc@in.tum.de> parents: 483 diff changeset	168	-- 38 for an example). If you need to represent an automaton with a
8182eb3278e0 updated Christian Urban <urbanc@in.tum.de> parents: 483 diff changeset	169	sink state (catch-all-state), you can use Scala's pattern matching and
8182eb3278e0 updated Christian Urban <urbanc@in.tum.de> parents: 483 diff changeset	170	write something like
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	171
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	172	{\small\begin{lstlisting}[language=Scala]
480 14318f1d3b0f updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	173	abstract class State
14318f1d3b0f updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	174	...
14318f1d3b0f updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	175	case object Sink extends State
14318f1d3b0f updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	176
14318f1d3b0f updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	177	val delta : (State, Char) :=> State =
14318f1d3b0f updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	178	{ case (S0, 'a') => S1
14318f1d3b0f updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	179	case (S1, 'a') => S2
14318f1d3b0f updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	180	case _ => Sink
14318f1d3b0f updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	181	}
14318f1d3b0f updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	182	\end{lstlisting}}
14318f1d3b0f updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	183
485 21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	184	\noindent I let you think what the corresponding DFA looks like in the
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	185	graph notation.
484 8182eb3278e0 updated Christian Urban <urbanc@in.tum.de> parents: 483 diff changeset	186
485 21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	187	Finally, I let you ponder whether this is a good implementation of
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	188	DFAs or not. In doing so I hope you notice that the $\varSigma$ and
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	189	$Qs$ components (the alphabet and the set of finite states,
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	190	respectively) are missing from the class definition. This means that
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	191	the implementation allows you to do some fishy things you are not
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	192	meant to do with DFAs. Which fishy things could that be?
480 14318f1d3b0f updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	193
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	194
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	195
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	196	\subsection*{Non-Deterministic Finite Automata}
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	197
485 21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	198	Remember we want to find out what the relation is between regular
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	199	expressions and automata. To do this with DFAs is a bit unwieldy.
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	200	While with DFAs it is always clear that given a state and a character
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	201	what the next state is (potentially none), it will be convenient to
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	202	relax this restriction. That means we allow states to have several
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	203	potential successor states. We even allow more than one starting
484 8182eb3278e0 updated Christian Urban <urbanc@in.tum.de> parents: 483 diff changeset	204	state. The resulting construction is called a \emph{Non-Deterministic
485 21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	205	Finite Automaton} (NFA) given also as a five-tuple ${\cal
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	206	A}(\varSigma, Qs, Q_{0s}, F, \rho)$ where
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	207
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	208	\begin{itemize}
484 8182eb3278e0 updated Christian Urban <urbanc@in.tum.de> parents: 483 diff changeset	209	\item $\varSigma$ is an alphabet,
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	210	\item $Qs$ is a finite set of states
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	211	\item $Q_{0s}$ is a set of start states ($Q_{0s} \subseteq Qs$)
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	212	\item $F$ are some accepting states with $F \subseteq Qs$, and
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	213	\item $\rho$ is a transition relation.
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	214	\end{itemize}
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	215
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	216	\noindent
483 faba5360372c updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	217	A typical example of a NFA is
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	218
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	219	% A NFA for (ab* + b)*a
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	220	\begin{center}
483 faba5360372c updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	221	\begin{tikzpicture}[>=stealth',very thick, auto,
faba5360372c updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	222	every state/.style={minimum size=0pt,inner sep=3pt,
faba5360372c updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	223	draw=blue!50,very thick,fill=blue!20},scale=2]
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	224	\node[state,initial] (Q_0) {$Q_0$};
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	225	\node[state] (Q_1) [right=of Q_0] {$Q_1$};
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	226	\node[state, accepting] (Q_2) [right=of Q_1] {$Q_2$};
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	227	\path[->] (Q_0) edge [loop above] node {$b$} ();
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	228	\path[<-] (Q_0) edge node [below] {$b$} (Q_1);
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	229	\path[->] (Q_0) edge [bend left] node [above] {$a$} (Q_1);
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	230	\path[->] (Q_0) edge [bend right] node [below] {$a$} (Q_2);
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	231	\path[->] (Q_1) edge [loop above] node {$a,b$} ();
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	232	\path[->] (Q_1) edge node [above] {$a$} (Q_2);
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	233	\end{tikzpicture}
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	234	\end{center}
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	235
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	236	\noindent
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	237	This NFA happens to have only one starting state, but in general there
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	238	could be more. Notice that in state $Q_0$ we might go to state $Q_1$
483 faba5360372c updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	239	\emph{or} to state $Q_2$ when receiving an $a$. Similarly in state
484 8182eb3278e0 updated Christian Urban <urbanc@in.tum.de> parents: 483 diff changeset	240	$Q_1$ and receiving an $a$, we can stay in state $Q_1$ \emph{or} go to
8182eb3278e0 updated Christian Urban <urbanc@in.tum.de> parents: 483 diff changeset	241	$Q_2$. This kind of choice is not allowed with DFAs. The downside of
488 057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	242	this choice in NFAs is that when it comes to deciding whether a string is
484 8182eb3278e0 updated Christian Urban <urbanc@in.tum.de> parents: 483 diff changeset	243	accepted by a NFA we potentially have to explore all possibilities. I
488 057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	244	let you think which strings the above NFA accepts.
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	245
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	246
485 21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	247	There are a number of additional points you should note about
483 faba5360372c updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	248	NFAs. Every DFA is a NFA, but not vice versa. The $\rho$ in NFAs is a
faba5360372c updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	249	transition \emph{relation} (DFAs have a transition function). The
faba5360372c updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	250	difference between a function and a relation is that a function has
faba5360372c updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	251	always a single output, while a relation gives, roughly speaking,
faba5360372c updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	252	several outputs. Look again at the NFA above: if you are currently in
faba5360372c updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	253	the state $Q_1$ and you read a character $b$, then you can transition
faba5360372c updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	254	to either $Q_0$ \emph{or} $Q_2$. Which route, or output, you take is
faba5360372c updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	255	not determined. This non-determinism can be represented by a
faba5360372c updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	256	relation.
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	257
483 faba5360372c updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	258	My implementation of NFAs in Scala is shown in Figure~\ref{nfa}.
faba5360372c updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	259	Perhaps interestingly, I do not actually use relations for my NFAs,
485 21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	260	but use transition functions that return sets of states. DFAs have
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	261	partial transition functions that return a single state; my NFAs
488 057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	262	return a set of states. I let you think about this representation for
485 21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	263	NFA-transitions and how it corresponds to the relations used in the
487 ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	264	mathematical definition of NFAs. An example of a transition function
488 057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	265	in Scala for the NFA shown above is
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	266
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	267	{\small\begin{lstlisting}[language=Scala]
487 ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	268	val nfa_delta : (State, Char) :=> Set[State] =
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	269	{ case (Q0, 'a') => Set(Q1, Q2)
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	270	case (Q0, 'b') => Set(Q0)
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	271	case (Q1, 'a') => Set(Q1, Q2)
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	272	case (Q1, 'b') => Set(Q0, Q1) }
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	273	\end{lstlisting}}
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	274
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	275	Like in the mathematical definition, \texttt{starts} is in
487 ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	276	NFAs a set of states; \texttt{fins} is again a function from states to
485 21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	277	booleans. The \texttt{next} function calculates the set of next states
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	278	reachable from a single state \texttt{q} by a character~\texttt{c}. In
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	279	case there is no such state---the partial transition function is
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	280	undefined---the empty set is returned (see function
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	281	\texttt{applyOrElse} in Lines 9 and 10). The function \texttt{nexts}
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	282	just lifts this function to sets of states.
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	283
484 8182eb3278e0 updated Christian Urban <urbanc@in.tum.de> parents: 483 diff changeset	284	\begin{figure}[p]
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	285	\small
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	286	\lstinputlisting[numbers=left]{../progs/display/nfa.scala}
485 21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	287	\caption{A Scala implementation of NFAs using partial functions.
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	288	Notice that the function \texttt{accepts} implements the
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	289	acceptance of a string in a breath-first search fashion. This can be a costly
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	290	way of deciding whether a string is accepted or not in applications that need to handle
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	291	large NFAs and large inputs.\label{nfa}}
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	292	\end{figure}
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	293
485 21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	294	Look very careful at the \texttt{accepts} and \texttt{deltas}
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	295	functions in NFAs and remember that when accepting a string by a NFA
484 8182eb3278e0 updated Christian Urban <urbanc@in.tum.de> parents: 483 diff changeset	296	we might have to explore all possible transitions (recall which state
485 21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	297	to go to is not unique anymore with NFAs\ldots{}we need to explore
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	298	potentially all next states). The implementation achieves this
487 ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	299	exploration through a \emph{breadth-first search}. This is fine for
485 21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	300	small NFAs, but can lead to real memory problems when the NFAs are
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	301	bigger and larger strings need to be processed. As result, some
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	302	regular expression matching engines resort to a \emph{depth-first
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	303	search} with \emph{backtracking} in unsuccessful cases. In our
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	304	implementation we can implement a depth-first version of
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	305	\texttt{accepts} using Scala's \texttt{exists}-function as follows:
483 faba5360372c updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	306
faba5360372c updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	307
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	308	{\small\begin{lstlisting}[language=Scala]
483 faba5360372c updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	309	def search(q: A, s: List[C]) : Boolean = s match {
faba5360372c updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	310	case Nil => fins(q)
485 21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	311	case c::cs => next(q, c).exists(search(_, cs))
483 faba5360372c updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	312	}
faba5360372c updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	313
485 21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	314	def accepts2(s: List[C]) : Boolean =
483 faba5360372c updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	315	starts.exists(search(_, s))
faba5360372c updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	316	\end{lstlisting}}
faba5360372c updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	317
faba5360372c updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	318	\noindent
487 ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	319	This depth-first way of exploration seems to work quite efficiently in
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	320	many examples and is much less of a strain on memory. The problem is
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	321	that the backtracking can get ``catastrophic'' in some
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	322	examples---remember the catastrophic backtracking from earlier
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	323	lectures. This depth-first search with backtracking is the reason for
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	324	the abysmal performance of some regular expression matchings in Java,
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	325	Ruby and Python. I like to show you this in the next two sections.
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	326
268 18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	327
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	328	\subsection*{Epsilon NFAs}
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	329
485 21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	330	In order to get an idea what calculations are performed by Java \&
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	331	friends, we need a method for transforming a regular expression into
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	332	an automaton. This automaton should accept exactly those strings that
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	333	are accepted by the regular expression. The simplest and most
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	334	well-known method for this is called \emph{Thompson Construction},
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	335	after the Turing Award winner Ken Thompson. This method is by
487 ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	336	recursion over regular expressions and depends on the non-determinism
488 057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	337	in NFAs described in the previous section. You will see shortly why
487 ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	338	this construction works well with NFAs, but is not so straightforward
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	339	with DFAs.
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	340
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	341	Unfortunately we are still one step away from our intended target
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	342	though---because this construction uses a version of NFAs that allows
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	343	``silent transitions''. The idea behind silent transitions is that
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	344	they allow us to go from one state to the next without having to
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	345	consume a character. We label such silent transition with the letter
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	346	$\epsilon$ and call the automata $\epsilon$NFAs. Two typical examples
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	347	of $\epsilon$NFAs are:
484 8182eb3278e0 updated Christian Urban <urbanc@in.tum.de> parents: 483 diff changeset	348
8182eb3278e0 updated Christian Urban <urbanc@in.tum.de> parents: 483 diff changeset	349
485 21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	350	\begin{center}
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	351	\begin{tabular}[t]{c@{\hspace{9mm}}c}
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	352	\begin{tikzpicture}[>=stealth',very thick,
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	353	every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	354	\node[state,initial] (Q_0) {$Q_0$};
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	355	\node[state] (Q_1) [above=of Q_0] {$Q_1$};
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	356	\node[state, accepting] (Q_2) [below=of Q_0] {$Q_2$};
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	357	\path[->] (Q_0) edge node [left] {$\epsilon$} (Q_1);
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	358	\path[->] (Q_0) edge node [left] {$\epsilon$} (Q_2);
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	359	\path[->] (Q_0) edge [loop right] node {$a$} ();
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	360	\path[->] (Q_1) edge [loop right] node {$a$} ();
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	361	\path[->] (Q_2) edge [loop right] node {$b$} ();
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	362	\end{tikzpicture} &
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	363
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	364	\raisebox{20mm}{
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	365	\begin{tikzpicture}[>=stealth',very thick,
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	366	every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	367	\node[state,initial] (r_1) {$R_1$};
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	368	\node[state] (r_2) [above=of r_1] {$R_2$};
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	369	\node[state, accepting] (r_3) [right=of r_1] {$R_3$};
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	370	\path[->] (r_1) edge node [below] {$b$} (r_3);
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	371	\path[->] (r_2) edge [bend left] node [above] {$a$} (r_3);
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	372	\path[->] (r_1) edge [bend left] node [left] {$\epsilon$} (r_2);
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	373	\path[->] (r_2) edge [bend left] node [right] {$a$} (r_1);
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	374	\end{tikzpicture}}
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	375	\end{tabular}
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	376	\end{center}
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	377
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	378	\noindent
487 ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	379	Consider the $\epsilon$NFA on the left-hand side: the
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	380	$\epsilon$-transitions mean you do not have to ``consume'' any part of
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	381	the input string, but ``silently'' change to a different state. In
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	382	this example, if you are in the starting state $Q_0$, you can silently
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	383	move either to $Q_1$ or $Q_2$. You can see that once you are in $Q_1$,
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	384	respectively $Q_2$, you cannot ``go back'' to the other states. So it
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	385	seems allowing $\epsilon$-transitions is a rather substantial
487 ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	386	extension to NFAs. On first appearances, $\epsilon$-transitions might
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	387	even look rather strange, or even silly. To start with, silent
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	388	transitions make the decision whether a string is accepted by an
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	389	automaton even harder: with $\epsilon$NFAs we have to look whether we
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	390	can do first some $\epsilon$-transitions and then do a
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	391	``proper''-transition; and after any ``proper''-transition we again
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	392	have to check whether we can do again some silent transitions. Even
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	393	worse, if there is a silent transition pointing back to the same
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	394	state, then we have to be careful our decision procedure for strings
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	395	does not loop (remember the depth-first search for exploring all
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	396	states).
485 21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	397
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	398	The obvious question is: Do we get anything in return for this hassle
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	399	with silent transitions? Well, we still have to work for it\ldots
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	400	unfortunately. If we were to follow the many textbooks on the
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	401	subject, we would now start with defining what $\epsilon$NFAs
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	402	are---that would require extending the transition relation of
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	403	NFAs. Next, we would show that the $\epsilon$NFAs are equivalent to
488 057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	404	NFAs and so on. Once we have done all this on paper, we would need to
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	405	implement $\epsilon$NFAs. Lets try to take a shortcut instead. We are
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	406	not really interested in $\epsilon$NFAs; they are only a convenient
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	407	tool for translating regular expressions into automata. So we are not
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	408	going to implementing them explicitly, but translate them immediately
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	409	into NFAs (in a sense $\epsilon$NFAs are just a convenient API for
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	410	lazy people ;o). How does the translation work? Well we have to find
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	411	all transitions of the form
485 21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	412
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	413	\[
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	414	q\stackrel{\epsilon}{\longrightarrow}\ldots\stackrel{\epsilon}{\longrightarrow}
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	415	\;\stackrel{a}{\longrightarrow}\;
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	416	\stackrel{\epsilon}{\longrightarrow}\ldots\stackrel{\epsilon}{\longrightarrow} q'
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	417	\]
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	418
492 882d5de18adc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	419	\noindent where somewhere in the ``middle'' is an $a$-transition. We
882d5de18adc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	420	replace them with $q \stackrel{a}{\longrightarrow} q'$. Doing this to
882d5de18adc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	421	the $\epsilon$NFA on the right-hand side above gives the NFA
485 21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	422
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	423	\begin{center}
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	424	\begin{tikzpicture}[>=stealth',very thick,
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	425	every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	426	\node[state,initial] (r_1) {$R_1$};
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	427	\node[state] (r_2) [above=of r_1] {$R_2$};
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	428	\node[state, accepting] (r_3) [right=of r_1] {$R_3$};
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	429	\path[->] (r_1) edge node [above] {$b$} (r_3);
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	430	\path[->] (r_2) edge [bend left] node [above] {$a$} (r_3);
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	431	\path[->] (r_1) edge [bend left] node [left] {$a$} (r_2);
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	432	\path[->] (r_2) edge [bend left] node [right] {$a$} (r_1);
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	433	\path[->] (r_1) edge [loop below] node {$a$} ();
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	434	\path[->] (r_1) edge [bend right] node [below] {$a$} (r_3);
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	435	\end{tikzpicture}
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	436	\end{center}
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	437
487 ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	438	\noindent where the single $\epsilon$-transition is replaced by
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	439	three additional $a$-transitions. Please do the calculations yourself
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	440	and verify that I did not forget any transition.
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	441
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	442	So in what follows, whenever we are given an $\epsilon$NFA we will
488 057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	443	replace it by an equivalent NFA. The Scala code for this translation
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	444	is given in Figure~\ref{enfa}. The main workhorse in this code is a
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	445	function that calculates a fixpoint of function (Lines 5--10). This
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	446	function is used for ``discovering'' which states are reachable by
487 ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	447	$\epsilon$-transitions. Once no new state is discovered, a fixpoint is
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	448	reached. This is used for example when calculating the starting states
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	449	of an equivalent NFA (see Line 36): we start with all starting states
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	450	of the $\epsilon$NFA and then look for all additional states that can
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	451	be reached by $\epsilon$-transitions. We keep on doing this until no
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	452	new state can be reached. This is what the $\epsilon$-closure, named
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	453	in the code \texttt{ecl}, calculates. Similarly, an accepting state of
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	454	the NFA is when we can reach an accepting state of the $\epsilon$NFA
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	455	by $\epsilon$-transitions.
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	456
485 21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	457
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	458	\begin{figure}[p]
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	459	\small
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	460	\lstinputlisting[numbers=left]{../progs/display/enfa.scala}
485 21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	461
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	462	\caption{A Scala function that translates $\epsilon$NFA into NFAs. The
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	463	transition function of $\epsilon$NFA takes as input an \texttt{Option[C]}.
485 21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	464	\texttt{None} stands for an $\epsilon$-transition; \texttt{Some(c)}
488 057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	465	for a ``proper'' transition consuming a character. The functions in
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	466	Lines 18--26 calculate
485 21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	467	all states reachable by one or more $\epsilon$-transition for a given
491 7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	468	set of states. The NFA is constructed in Lines 36--38.
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	469	Note the interesting commands in Lines 5 and 6: their purpose is
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	470	to ensure that \texttt{fixpT} is the tail-recursive version of
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	471	the fixpoint construction; otherwise we would quickly get a
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	472	stack-overflow exception, even on small examples, due to limitations
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	473	of the JVM.
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	474	\label{enfa}}
485 21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	475	\end{figure}
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	476
487 ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	477	Also look carefully how the transitions of $\epsilon$NFAs are
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	478	implemented. The additional possibility of performing silent
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	479	transitions is encoded by using \texttt{Option[C]} as the type for the
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	480	``input''. The \texttt{Some}s stand for ``proper'' transitions where
487 ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	481	a character is consumed; \texttt{None} stands for
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	482	$\epsilon$-transitions. The transition functions for the two
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	483	$\epsilon$NFAs from the beginning of this section can be defined as
485 21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	484
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	485	{\small\begin{lstlisting}[language=Scala]
487 ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	486	val enfa_trans1 : (State, Option[Char]) :=> Set[State] =
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	487	{ case (Q0, Some('a')) => Set(Q0)
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	488	case (Q0, None) => Set(Q1, Q2)
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	489	case (Q1, Some('a')) => Set(Q1)
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	490	case (Q2, Some('b')) => Set(Q2) }
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	491
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	492	val enfa_trans2 : (State, Option[Char]) :=> Set[State] =
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	493	{ case (R1, Some('b')) => Set(R3)
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	494	case (R1, None) => Set(R2)
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	495	case (R2, Some('a')) => Set(R1, R3) }
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	496	\end{lstlisting}}
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	497
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	498	\noindent
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	499	I hope you agree now with my earlier statement that the $\epsilon$NFAs
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	500	are just an API for NFAs.
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	501
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	502	\subsection*{Thompson Construction}
487 ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	503
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	504	Having the translation of $\epsilon$NFAs to NFAs in place, we can
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	505	finally return to the problem of translating regular expressions into
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	506	equivalent NFAs. Recall that by equivalent we mean that the NFAs
485 21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	507	recognise the same language. Consider the simple regular expressions
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	508	$\ZERO$, $\ONE$ and $c$. They can be translated into equivalent NFAs
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	509	as follows:
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	510
488 057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	511	\begin{equation}\mbox{
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	512	\begin{tabular}[t]{l@{\hspace{10mm}}l}
444 3056a4c071b0 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 349 diff changeset	513	\raisebox{1mm}{$\ZERO$} &
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	514	\begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	515	\node[state, initial] (Q_0) {$\mbox{}$};
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	516	\end{tikzpicture}\\\\
444 3056a4c071b0 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 349 diff changeset	517	\raisebox{1mm}{$\ONE$} &
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	518	\begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	519	\node[state, initial, accepting] (Q_0) {$\mbox{}$};
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	520	\end{tikzpicture}\\\\
487 ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	521	\raisebox{3mm}{$c$} &
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	522	\begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	523	\node[state, initial] (Q_0) {$\mbox{}$};
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	524	\node[state, accepting] (Q_1) [right=of Q_0] {$\mbox{}$};
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	525	\path[->] (Q_0) edge node [below] {$c$} (Q_1);
487 ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	526	\end{tikzpicture}\\
488 057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	527	\end{tabular}}\label{simplecases}
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	528	\end{equation}
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	529
487 ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	530	\noindent
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	531	I let you think whether the NFAs can match exactly those strings the
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	532	regular expressions can match. To do this translation in code we need
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	533	a way to construct states programatically...and as an additional
488 057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	534	constrain Scala needs to recognise that these states are being distinct.
487 ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	535	For this I implemented in Figure~\ref{thompson1} a class
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	536	\texttt{TState} that includes a counter and a companion object that
488 057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	537	increases this counter whenever a new state is created.\footnote{You might
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	538	have to read up what \emph{companion objects} do in Scala.}
487 ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	539
485 21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	540	\begin{figure}[p]
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	541	\small
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	542	\lstinputlisting[numbers=left]{../progs/display/thompson1.scala}
487 ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	543	\caption{The first part of the Thompson Construction. Lines 7--16
488 057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	544	implement a way of how to create new states that are all
487 ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	545	distinct by virtue of a counter. This counter is
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	546	increased in the companion object of \texttt{TState}
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	547	whenever a new state is created. The code in Lines 24--40
488 057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	548	constructs NFAs for the simple regular expressions $\ZERO$, $\ONE$ and $c$.
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	549	Compare the pictures given in \eqref{simplecases}.
487 ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	550	\label{thompson1}}
485 21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	551	\end{figure}
21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	552
487 ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	553	\begin{figure}[p]
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	554	\small
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	555	\lstinputlisting[numbers=left]{../progs/display/thompson2.scala}
487 ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	556	\caption{The second part of the Thompson Construction implementing
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	557	the composition of NFAs according to $\cdot$, $+$ and ${}^*$.
487 ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	558	The implicit class about rich partial functions
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	559	implements the infix operation \texttt{+++} which
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	560	combines an $\epsilon$NFA transition with a NFA transition
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	561	(both given as partial functions).\label{thompson2}}
ffbc65112d48 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	562	\end{figure}
485 21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	563
488 057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	564	The case for the sequence regular expression $r_1 \cdot r_2$ is a bit more
489 4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	565	complicated: Say, we are given by recursion two NFAs representing the regular
488 057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	566	expressions $r_1$ and $r_2$ respectively.
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	567
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	568	\begin{center}
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	569	\begin{tikzpicture}[node distance=3mm,
488 057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	570	>=stealth',very thick,
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	571	every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	572	\node[state, initial] (Q_0) {$\mbox{}$};
488 057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	573	\node[state, initial] (Q_01) [below=1mm of Q_0] {$\mbox{}$};
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	574	\node[state, initial] (Q_02) [above=1mm of Q_0] {$\mbox{}$};
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	575	\node (R_1) [right=of Q_0] {$\ldots$};
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	576	\node[state, accepting] (T_1) [right=of R_1] {$\mbox{}$};
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	577	\node[state, accepting] (T_2) [above=of T_1] {$\mbox{}$};
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	578	\node[state, accepting] (T_3) [below=of T_1] {$\mbox{}$};
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	579
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	580	\node (A_0) [right=2.5cm of T_1] {$\mbox{}$};
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	581	\node[state, initial] (A_01) [above=1mm of A_0] {$\mbox{}$};
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	582	\node[state, initial] (A_02) [below=1mm of A_0] {$\mbox{}$};
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	583
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	584	\node (b_1) [right=of A_0] {$\ldots$};
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	585	\node[state, accepting] (c_1) [right=of b_1] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	586	\node[state, accepting] (c_2) [above=of c_1] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	587	\node[state, accepting] (c_3) [below=of c_1] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	588	\begin{pgfonlayer}{background}
488 057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	589	\node (1) [rounded corners, inner sep=1mm, thick,
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	590	draw=black!60, fill=black!20, fit= (Q_0) (R_1) (T_1) (T_2) (T_3)] {};
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	591	\node (2) [rounded corners, inner sep=1mm, thick,
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	592	draw=black!60, fill=black!20, fit= (A_0) (b_1) (c_1) (c_2) (c_3)] {};
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	593	\node [yshift=2mm] at (1.north) {$r_1$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	594	\node [yshift=2mm] at (2.north) {$r_2$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	595	\end{pgfonlayer}
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	596	\end{tikzpicture}
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	597	\end{center}
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	598
488 057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	599	\noindent The first NFA has some accepting states and the second some
489 4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	600	starting states. We obtain an $\epsilon$NFA for $r_1\cdot r_2$ by
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	601	connecting the accepting states of the first NFA with
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	602	$\epsilon$-transitions to the starting states of the second
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	603	automaton. By doing so we make the accepting states of the first NFA
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	604	to be non-accepting like so:
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	605
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	606	\begin{center}
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	607	\begin{tikzpicture}[node distance=3mm,
488 057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	608	>=stealth',very thick,
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	609	every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	610	\node[state, initial] (Q_0) {$\mbox{}$};
488 057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	611	\node[state, initial] (Q_01) [below=1mm of Q_0] {$\mbox{}$};
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	612	\node[state, initial] (Q_02) [above=1mm of Q_0] {$\mbox{}$};
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	613	\node (r_1) [right=of Q_0] {$\ldots$};
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	614	\node[state] (t_1) [right=of r_1] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	615	\node[state] (t_2) [above=of t_1] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	616	\node[state] (t_3) [below=of t_1] {$\mbox{}$};
488 057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	617
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	618	\node (A_0) [right=2.5cm of t_1] {$\mbox{}$};
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	619	\node[state] (A_01) [above=1mm of A_0] {$\mbox{}$};
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	620	\node[state] (A_02) [below=1mm of A_0] {$\mbox{}$};
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	621
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	622	\node (b_1) [right=of A_0] {$\ldots$};
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	623	\node[state, accepting] (c_1) [right=of b_1] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	624	\node[state, accepting] (c_2) [above=of c_1] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	625	\node[state, accepting] (c_3) [below=of c_1] {$\mbox{}$};
488 057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	626	\path[->] (t_1) edge (A_01);
492 882d5de18adc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	627	\path[->] (t_2) edge node [above] {$\epsilon$s} (A_01);
488 057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	628	\path[->] (t_3) edge (A_01);
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	629	\path[->] (t_1) edge (A_02);
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	630	\path[->] (t_2) edge (A_02);
492 882d5de18adc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	631	\path[->] (t_3) edge node [below] {$\epsilon$s} (A_02);
488 057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	632
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	633	\begin{pgfonlayer}{background}
488 057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	634	\node (3) [rounded corners, inner sep=1mm, thick,
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	635	draw=black!60, fill=black!20, fit= (Q_0) (c_1) (c_2) (c_3)] {};
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	636	\node [yshift=2mm] at (3.north) {$r_1\cdot r_2$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	637	\end{pgfonlayer}
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	638	\end{tikzpicture}
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	639	\end{center}
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	640
489 4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	641	\noindent The idea behind this construction is that the start of any
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	642	string is first recognised by the first NFA, then we silently change
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	643	to the second NFA; the ending of the string is recognised by the
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	644	second NFA...just like matching of a string by the regular expression
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	645	$r_1\cdot r_2$. The Scala code for this construction is given in
489 4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	646	Figure~\ref{thompson2} in Lines 16--23. The starting states of the
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	647	$\epsilon$NFA are the starting states of the first NFA (corresponding
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	648	to $r_1$); the accepting function is the accepting function of the
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	649	second NFA (corresponding to $r_2$). The new transition function is
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	650	all the ``old'' transitions plus the $\epsilon$-transitions connecting
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	651	the accepting states of the first NFA to the starting states of the
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	652	first NFA (Lines 18 and 19). The $\epsilon$NFA is then immediately
489 4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	653	translated in a NFA.
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	654
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	655
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	656	The case for the alternative regular expression $r_1 + r_2$ is
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	657	slightly different: We are given by recursion two NFAs representing
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	658	$r_1$ and $r_2$ respectively. Each NFA has some starting states and
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	659	some accepting states. We obtain a NFA for the regular expression $r_1
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	660	+ r_2$ by composing the transition functions (this crucially depends
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	661	on knowing that the states of each component NFA are distinct); and
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	662	also combine the starting states and accepting functions.
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	663
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	664	\begin{center}
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	665	\begin{tabular}[t]{ccc}
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	666	\begin{tikzpicture}[node distance=3mm,
488 057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	667	>=stealth',very thick,
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	668	every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	669	baseline=(current bounding box.center)]
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	670	\node at (0,0) (1) {$\mbox{}$};
489 4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	671	\node (2) [above=10mm of 1] {};
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	672	\node[state, initial] (4) [above=1mm of 2] {$\mbox{}$};
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	673	\node[state, initial] (5) [below=1mm of 2] {$\mbox{}$};
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	674	\node[state, initial] (3) [below=10mm of 1] {$\mbox{}$};
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	675
489 4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	676	\node (a) [right=of 2] {$\ldots\,$};
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	677	\node (a1) [right=of a] {$$};
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	678	\node[state, accepting] (a2) [above=of a1] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	679	\node[state, accepting] (a3) [below=of a1] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	680
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	681	\node (b) [right=of 3] {$\ldots$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	682	\node[state, accepting] (b1) [right=of b] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	683	\node[state, accepting] (b2) [above=of b1] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	684	\node[state, accepting] (b3) [below=of b1] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	685	\begin{pgfonlayer}{background}
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	686	\node (1) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (2) (a1) (a2) (a3)] {};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	687	\node (2) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (3) (b1) (b2) (b3)] {};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	688	\node [yshift=3mm] at (1.north) {$r_1$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	689	\node [yshift=3mm] at (2.north) {$r_2$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	690	\end{pgfonlayer}
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	691	\end{tikzpicture}
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	692	&
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	693	\mbox{}\qquad\tikz{\draw[>=stealth,line width=2mm,->] (0,0) -- (1, 0)}\quad\mbox{}
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	694	&
489 4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	695	\begin{tikzpicture}[node distance=3mm,
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	696	>=stealth',very thick,
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	697	every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	698	baseline=(current bounding box.center)]
489 4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	699	\node at (0,0) (1) {$\mbox{}$};
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	700	\node (2) [above=10mm of 1] {$$};
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	701	\node[state, initial] (4) [above=1mm of 2] {$\mbox{}$};
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	702	\node[state, initial] (5) [below=1mm of 2] {$\mbox{}$};
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	703	\node[state, initial] (3) [below=10mm of 1] {$\mbox{}$};
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	704
489 4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	705	\node (a) [right=of 2] {$\ldots\,$};
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	706	\node (a1) [right=of a] {$$};
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	707	\node[state, accepting] (a2) [above=of a1] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	708	\node[state, accepting] (a3) [below=of a1] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	709
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	710	\node (b) [right=of 3] {$\ldots$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	711	\node[state, accepting] (b1) [right=of b] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	712	\node[state, accepting] (b2) [above=of b1] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	713	\node[state, accepting] (b3) [below=of b1] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	714
489 4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	715	%\path[->] (1) edge node [above] {$\epsilon$} (2);
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	716	%\path[->] (1) edge node [below] {$\epsilon$} (3);
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	717
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	718	\begin{pgfonlayer}{background}
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	719	\node (3) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (1) (a2) (a3) (b2) (b3)] {};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	720	\node [yshift=3mm] at (3.north) {$r_1+ r_2$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	721	\end{pgfonlayer}
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	722	\end{tikzpicture}
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	723	\end{tabular}
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	724	\end{center}
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	725
489 4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	726	\noindent The code for this construction is in Figure~\ref{thompson2}
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	727	in Lines 25--33.
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	728
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	729	Finally for the $*$-case we have a NFA for $r$ and connect its
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	730	accepting states to a new starting state via
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	731	$\epsilon$-transitions. This new starting state is also an accepting
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	732	state, because $r^*$ can recognise the empty string.
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	733
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	734	\begin{center}
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	735	\begin{tabular}[b]{@{\hspace{-4mm}}ccc@{}}
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	736	\begin{tikzpicture}[node distance=3mm,
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	737	>=stealth',very thick,
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	738	every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	739	baseline=(current bounding box.north)]
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	740	\node at (0,0) (1) {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	741	\node[state, initial] (2) [right=16mm of 1] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	742	\node (a) [right=of 2] {$\ldots$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	743	\node[state, accepting] (a1) [right=of a] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	744	\node[state, accepting] (a2) [above=of a1] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	745	\node[state, accepting] (a3) [below=of a1] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	746	\begin{pgfonlayer}{background}
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	747	\node (1) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (2) (a1) (a2) (a3)] {};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	748	\node [yshift=3mm] at (1.north) {$r$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	749	\end{pgfonlayer}
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	750	\end{tikzpicture}
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	751	&
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	752	\raisebox{-16mm}{\;\tikz{\draw[>=stealth,line width=2mm,->] (0,0) -- (1, 0)}}
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	753	&
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	754	\begin{tikzpicture}[node distance=3mm,
489 4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	755	>=stealth',very thick,
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	756	every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	757	baseline=(current bounding box.north)]
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	758	\node at (0,0) [state, initial,accepting] (1) {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	759	\node[state] (2) [right=16mm of 1] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	760	\node (a) [right=of 2] {$\ldots$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	761	\node[state] (a1) [right=of a] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	762	\node[state] (a2) [above=of a1] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	763	\node[state] (a3) [below=of a1] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	764	\path[->] (1) edge node [above] {$\epsilon$} (2);
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	765	\path[->] (a1) edge [bend left=45] node [above] {$\epsilon$} (1);
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	766	\path[->] (a2) edge [bend right] node [below] {$\epsilon$} (1);
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	767	\path[->] (a3) edge [bend left=45] node [below] {$\epsilon$} (1);
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	768	\begin{pgfonlayer}{background}
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	769	\node (2) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (1) (a2) (a3)] {};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	770	\node [yshift=3mm] at (2.north) {$r^*$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	771	\end{pgfonlayer}
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	772	\end{tikzpicture}
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	773	\end{tabular}
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	774	\end{center}
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	775
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	776	\noindent
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	777	The corresponding code is in Figure~\ref{thompson2} in Lines 35--43)
489 4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	778
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	779	To sum up, you can see in the sequence and star cases the need of
489 4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	780	having silent $\epsilon$-transitions. Similarly the alternative case
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	781	shows the need of the NFA-nondeterminism. It seems awkward to form the
489 4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	782	`alternative' composition of two DFAs, because DFA do not allow
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	783	several starting and successor states. All these constructions can now
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	784	be put together in the following recursive function:
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	785
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	786
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	787	{\small\begin{lstlisting}[language=Scala]
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	788	def thompson(r: Rexp) : NFAt = r match {
488 057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	789	case ZERO => NFA_ZERO()
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	790	case ONE => NFA_ONE()
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	791	case CHAR(c) => NFA_CHAR(c)
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	792	case ALT(r1, r2) => NFA_ALT(thompson(r1), thompson(r2))
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	793	case SEQ(r1, r2) => NFA_SEQ(thompson(r1), thompson(r2))
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	794	case STAR(r1) => NFA_STAR(thompson(r1))
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	795	}
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	796	\end{lstlisting}}
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	797
489 4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	798	\noindent
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	799	It calculates a NFA from a regular expressions. At last we can run
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	800	NFAs for the our evil regular expression examples. The graph on the
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	801	left shows that when translating a regular expression such as
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	802	$a^{\{n\}}$ into a NFA, the size can blow up and then even the
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	803	relative fast (on small examples) breadth-first search can be
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	804	slow. The graph on the right shows that with `evil' regular
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	805	expressions the depth-first search can be abysmally slow. Even if the
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	806	graphs not completely overlap with the curves of Python, Ruby and
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	807	Java, they are similar enough. OK\ldots now you know why regular
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	808	expression matchers in those languages are so slow.
489 4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	809
488 057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	810
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	811	\begin{center}
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	812	\begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1mm}}c@{}}
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	813	\begin{tikzpicture}
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	814	\begin{axis}[
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	815	title={Graph: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings
489 4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	816	$\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	817	title style={yshift=-2ex},
489 4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	818	xlabel={$n$},
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	819	x label style={at={(1.05,0.0)}},
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	820	ylabel={time in secs},
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	821	enlargelimits=false,
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	822	xtick={0,5,...,30},
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	823	xmax=33,
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	824	ymax=35,
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	825	ytick={0,5,...,30},
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	826	scaled ticks=false,
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	827	axis lines=left,
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	828	width=5.5cm,
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	829	height=4cm,
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	830	legend entries={Python,Ruby, breadth-first NFA},
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	831	legend style={at={(0.5,-0.25)},anchor=north,font=\small},
489 4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	832	legend cell align=left]
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	833	\addplot[blue,mark=*, mark options={fill=white}] table {re-python.data};
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	834	\addplot[brown,mark=triangle*, mark options={fill=white}] table {re-ruby.data};
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	835	% breath-first search in NFAs
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	836	\addplot[red,mark=*, mark options={fill=white}] table {
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	837	1 0.00586
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	838	2 0.01209
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	839	3 0.03076
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	840	4 0.08269
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	841	5 0.12881
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	842	6 0.25146
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	843	7 0.51377
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	844	8 0.89079
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	845	9 1.62802
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	846	10 3.05326
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	847	11 5.92437
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	848	12 11.67863
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	849	13 24.00568
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	850	};
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	851	\end{axis}
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	852	\end{tikzpicture}
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	853	&
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	854	\begin{tikzpicture}
4430477595ec updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	855	\begin{axis}[
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	856	title={Graph: $(a^)^ \cdot b$ and strings
488 057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	857	$\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	858	title style={yshift=-2ex},
488 057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	859	xlabel={$n$},
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	860	x label style={at={(1.05,0.0)}},
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	861	ylabel={time in secs},
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	862	enlargelimits=false,
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	863	xtick={0,5,...,30},
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	864	xmax=33,
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	865	ymax=35,
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	866	ytick={0,5,...,30},
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	867	scaled ticks=false,
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	868	axis lines=left,
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	869	width=5.5cm,
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	870	height=4cm,
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	871	legend entries={Python, Java, depth-first NFA},
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	872	legend style={at={(0.5,-0.25)},anchor=north,font=\small},
488 057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	873	legend cell align=left]
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	874	\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	875	\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	876	% depth-first search in NFAs
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	877	\addplot[red,mark=*, mark options={fill=white}] table {
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	878	1 0.00605
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	879	2 0.03086
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	880	3 0.11994
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	881	4 0.45389
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	882	5 2.06192
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	883	6 8.04894
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	884	7 32.63549
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	885	};
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	886	\end{axis}
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	887	\end{tikzpicture}
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	888	\end{tabular}
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	889	\end{center}
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	890
057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	891
268 18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	892
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	893	\subsection*{Subset Construction}
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	894
491 7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	895	Of course, some developers of regular expression matchers are aware of
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	896	these problems with sluggish NFAs and try to address them. One common
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	897	technique for alleviating the problem I like to show you in this
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	898	section. This will also explain why we insisted on polymorphic types in
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	899	our DFA code (remember I used \texttt{A} and \texttt{C} for the types
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	900	of states and the input, see Figure~\ref{dfa} on
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	901	Page~\pageref{dfa}).\bigskip
268 18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	902
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	903	\noindent
491 7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	904	To start remember that we did not bother with defining and
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	905	implementing $\epsilon$NFAs: we immediately translated them into
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	906	equivalent NFAs. Equivalent in the sense of accepting the same
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	907	language (though we only claimed this and did not prove it
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	908	rigorously). Remember also that NFAs have non-deterministic
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	909	transitions defined as a relation or implemented as function returning
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	910	sets of states. This non-determinism is crucial for the Thompson
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	911	Construction to work (recall the cases for $\cdot$, $+$ and
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	912	${}^*$). But this non-determinism makes it harder with NFAs to decide
491 7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	913	when a string is accepted or not; whereas such a decision is rather
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	914	straightforward with DFAs: recall their transition function is a
491 7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	915	\emph{function} that returns a single state. So with DFAs we do not
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	916	have to search at all. What is perhaps interesting is the fact that
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	917	for every NFA we can find a DFA that also recognises the same
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	918	language. This might sound a bit paradoxical: NFA $\rightarrow$
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	919	decision of acceptance hard; DFA $\rightarrow$ decision easy. But this
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	920	\emph{is} true\ldots but of course there is always a caveat---nothing
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	921	ever is for free in life.
488 057b4603b940 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	922
491 7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	923	There are actually a number of methods for transforming a NFA into
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	924	an equivalent DFA, but the most famous one is the \emph{subset
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	925	construction}. Consider the following NFA where the states are
491 7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	926	labelled with $0$, $1$ and $2$.
268 18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	927
18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	928	\begin{center}
18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	929	\begin{tabular}{c@{\hspace{10mm}}c}
18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	930	\begin{tikzpicture}[scale=0.7,>=stealth',very thick,
18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	931	every state/.style={minimum size=0pt,
18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	932	draw=blue!50,very thick,fill=blue!20},
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	933	baseline=(current bounding box.center)]
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	934	\node[state,initial] (Q_0) {$0$};
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	935	\node[state] (Q_1) [below=of Q_0] {$1$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	936	\node[state, accepting] (Q_2) [below=of Q_1] {$2$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	937
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	938	\path[->] (Q_0) edge node [right] {$b$} (Q_1);
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	939	\path[->] (Q_1) edge node [right] {$a,b$} (Q_2);
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	940	\path[->] (Q_0) edge [loop above] node {$a, b$} ();
268 18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	941	\end{tikzpicture}
18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	942	&
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	943	\begin{tabular}{r\|ll}
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	944	states & $a$ & $b$\\
268 18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	945	\hline
344 408fd5994288 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 333 diff changeset	946	$\{\}\phantom{\star}$ & $\{\}$ & $\{\}$\\
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	947	start: $\{0\}\phantom{\star}$ & $\{0\}$ & $\{0,1\}$\\
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	948	$\{1\}\phantom{\star}$ & $\{2\}$ & $\{2\}$\\
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	949	$\{2\}\star$ & $\{\}$ & $\{\}$\\
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	950	$\{0,1\}\phantom{\star}$ & $\{0,2\}$ & $\{0,1,2\}$\\
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	951	$\{0,2\}\star$ & $\{0\}$ & $\{0,1\}$\\
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	952	$\{1,2\}\star$ & $\{2\}$ & $\{2\}$\\
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	953	$\{0,1,2\}\star$ & $\{0,2\}$ & $\{0,1,2\}$\\
268 18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	954	\end{tabular}
18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	955	\end{tabular}
18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	956	\end{center}
18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	957
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	958	\noindent The states of the corresponding DFA are given by generating
491 7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	959	all subsets of the set $\{0,1,2\}$ (seen in the states column
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	960	in the table on the right). The other columns define the transition
491 7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	961	function for the DFA for inputs $a$ and $b$. The first row states that
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	962	$\{\}$ is the sink state which has transitions for $a$ and $b$ to
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	963	itself. The next three lines are calculated as follows:
268 18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	964
18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	965	\begin{itemize}
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	966	\item Suppose you calculate the entry for the $a$-transition for state
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	967	$\{0\}$. Look for all states in the NFA that can be reached by such
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	968	a transition from this state; this is only state $0$; therefore from
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	969	state $\{0\}$ we can go to state $\{0\}$ via an $a$-transition.
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	970	\item Do the same for the $b$-transition; you can reach states $0$ and
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	971	$1$ in the NFA; therefore in the DFA we can go from state $\{0\}$ to
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	972	state $\{0,1\}$ via an $b$-transition.
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	973	\item Continue with the states $\{1\}$ and $\{2\}$.
268 18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	974	\end{itemize}
18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	975
491 7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	976	\noindent
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	977	Once you filled in the transitions for `simple' states $\{0\}$
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	978	.. $\{2\}$, you only have to build the union for the compound states
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	979	$\{0,1\}$, $\{0,2\}$ and so on. For example for $\{0,1\}$ you take the
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	980	union of Line $\{0\}$ and Line $\{1\}$, which gives $\{0,2\}$ for $a$,
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	981	and $\{0,1,2\}$ for $b$. And so on.
344 408fd5994288 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 333 diff changeset	982
491 7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	983	The starting state of the DFA can be calculated from the starting
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	984	states of the NFA, that is in this case $\{0\}$. But in general there
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	985	can of course be many starting states in the NFA and you would take
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	986	the corresponding subset as \emph{the} starting state of the DFA.
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	987
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	988	The accepting states in the DFA are given by all sets that contain a
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	989	$2$, which is the only accpting state in this NFA. But again in
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	990	general if the subset contains any accepting state from the NFA, then
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	991	the corresponding state in the DFA is accepting as well. This
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	992	completes the subset construction. The corresponding DFA for the NFA
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	993	shown above is:
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	994
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	995	\begin{equation}
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	996	\begin{tikzpicture}[scale=0.8,>=stealth',very thick,
344 408fd5994288 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 333 diff changeset	997	every state/.style={minimum size=0pt,
491 7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	998	draw=blue!50,very thick,fill=blue!20},
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	999	baseline=(current bounding box.center)]
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1000	\node[state,initial] (q0) {$0$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1001	\node[state] (q01) [right=of q0] {$0,1$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1002	\node[state,accepting] (q02) [below=of q01] {$0,2$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1003	\node[state,accepting] (q012) [right=of q02] {$0,1,2$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1004	\node[state] (q1) [below=0.5cm of q0] {$1$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1005	\node[state,accepting] (q2) [below=1cm of q1] {$2$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1006	\node[state] (qn) [below left=1cm of q2] {$\{\}$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1007	\node[state,accepting] (q12) [below right=1cm of q2] {$1,2$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1008
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1009	\path[->] (q0) edge node [above] {$b$} (q01);
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1010	\path[->] (q01) edge node [above] {$b$} (q012);
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1011	\path[->] (q0) edge [loop above] node {$a$} ();
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1012	\path[->] (q012) edge [loop right] node {$b$} ();
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1013	\path[->] (q012) edge node [below] {$a$} (q02);
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1014	\path[->] (q02) edge node [below] {$a$} (q0);
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1015	\path[->] (q01) edge [bend left] node [left] {$a$} (q02);
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1016	\path[->] (q02) edge [bend left] node [right] {$b$} (q01);
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1017	\path[->] (q1) edge node [left] {$a,b$} (q2);
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1018	\path[->] (q12) edge node [right] {$a, b$} (q2);
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1019	\path[->] (q2) edge node [right] {$a, b$} (qn);
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1020	\path[->] (qn) edge [loop left] node {$a,b$} ();
491 7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1021	\end{tikzpicture}\label{subsetdfa}
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1022	\end{equation}
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1023
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1024	\noindent
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1025	Please check that this is indeed a DFA. The big question is whether
491 7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1026	this DFA can recognise the same language as the NFA we started with?
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1027	I let you ponder about this question.
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1028
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1029
491 7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1030	There are also two points to note: One is that very often in the
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1031	subset construction the resulting DFA contains a number of ``dead''
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1032	states that are never reachable from the starting state. This is
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1033	obvious in the example, where state $\{1\}$, $\{2\}$, $\{1,2\}$ and
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1034	$\{\}$ can never be reached from the starting state. But this might
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1035	not always be as obvious as that. In effect the DFA in this example is
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1036	not a \emph{minimal} DFA (more about this in a minute). Such dead
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1037	states can be safely removed without changing the language that is
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1038	recognised by the DFA. Another point is that in some cases, however,
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1039	the subset construction produces a DFA that does \emph{not} contain
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1040	any dead states\ldots{}this means it calculates a minimal DFA. Which
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1041	in turn means that in some cases the number of states can by going
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1042	from NFAs to DFAs exponentially increase, namely by $2^n$ (which is
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1043	the number of subsets you can form for sets of $n$ states). This blow
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1044	up in the number of states in the DFA is again bad news for how
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1045	quickly you can decide whether a string is accepted by a DFA or
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1046	not. So the caveat with DFAs is that they might make the task of
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1047	finding the next state trival, but might require $2^n$ times as many
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1048	states then a NFA.\bigskip
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1049
491 7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1050	\noindent
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1051	To conclude this section, how conveniently we can
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1052	implement the subset construction with our versions of NFAs and
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1053	DFAs? Very conveninetly. The code is just:
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1054
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1055	{\small\begin{lstlisting}[language=Scala]
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1056	def subset[A, C](nfa: NFA[A, C]) : DFA[Set[A], C] = {
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1057	DFA(nfa.starts,
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1058	{ case (qs, c) => nfa.nexts(qs, c) },
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1059	_.exists(nfa.fins))
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1060	}
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1061	\end{lstlisting}}
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1062
491 7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1063	\noindent
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1064	The interesting point in this code is that the state type of the
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1065	calculated DFA is \texttt{Set[A]}. Think carefully that this works out
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1066	correctly.
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1067
491 7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1068	The DFA is then given by three components: the starting states, the
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1069	transition function and the accepting-states function. The starting
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1070	states are a set in the given NFA, but a single state in the DFA. The
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1071	transition function, given the state \texttt{qs} and input \texttt{c},
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1072	needs to produce the next state: this is the set of all NFA states
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1073	that are reachable from each state in \texttt{qs}. The function
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1074	\texttt{nexts} from the NFA class already calculates this for us. The
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1075	accepting-states function for the DFA is true henevner at least one
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1076	state in the subset is accepting (that is true) in the NFA.\medskip
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1077
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1078	\noindent
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1079	You might be able to spend some quality tinkering with this code and
492 882d5de18adc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	1080	time to ponder about it. Then you will probably notice it is actually
882d5de18adc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	1081	a bit silly. The whole point of translating the NFA into a DFA via the
491 7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1082	subset construction is to make the decision of whether a string is
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1083	accepted or not faster. Given the code above, the generated DFA will
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1084	be exactly as fast, or as slow, as the NFA we started with (actually
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1085	it will even be a tiny bit slower). The reason is that we just re-use
492 882d5de18adc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	1086	the \texttt{nexts} function from the NFA. This function implements the
882d5de18adc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	1087	non-deterministic breadth-first search. You might be thinking: This
491 7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1088	is cheating! \ldots{} Well, not quite as you will see later, but in
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1089	terms of speed we still need to work a bit in order to get
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1090	sometimes(!) a faster DFA. Let's do this next.
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1091
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1092	\subsection*{DFA Minimisation}
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1093
491 7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1094	As seen in \eqref{subsetdfa}, the subset construction from NFA to a
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1095	DFA can result in a rather ``inefficient'' DFA. Meaning there are
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1096	states that are not needed. There are two kinds of such unneeded
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1097	states: \emph{unreachable} states and \emph{nondistinguishable}
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1098	states. The first kind of states can just be removed without affecting
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1099	the language that can be recognised (after all they are
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1100	unreachable). The second kind can also be recognised and thus a DFA
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1101	can be \emph{minimised} by the following algorithm:
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1102
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1103	\begin{enumerate}
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1104	\item Take all pairs $(q, p)$ with $q \not= p$
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1105	\item Mark all pairs that accepting and non-accepting states
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1106	\item For all unmarked pairs $(q, p)$ and all characters $c$
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1107	test whether
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1108
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1109	\begin{center}
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1110	$(\delta(q, c), \delta(p,c))$
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1111	\end{center}
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1112
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1113	are marked. If there is one, then also mark $(q, p)$.
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1114	\item Repeat last step until no change.
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1115	\item All unmarked pairs can be merged.
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1116	\end{enumerate}
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1117
491 7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1118	\noindent Unfortunately, once we throw away all unreachable states in
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1119	\eqref{subsetdfa}, all remaining states are needed. In order to
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1120	illustrate the minimisation algorithm, consider the following DFA.
344 408fd5994288 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 333 diff changeset	1121
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1122	\begin{center}
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1123	\begin{tikzpicture}[>=stealth',very thick,auto,
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1124	every state/.style={minimum size=0pt,
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1125	inner sep=2pt,draw=blue!50,very thick,
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1126	fill=blue!20}]
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1127	\node[state,initial] (Q_0) {$Q_0$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1128	\node[state] (Q_1) [right=of Q_0] {$Q_1$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1129	\node[state] (Q_2) [below right=of Q_0] {$Q_2$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1130	\node[state] (Q_3) [right=of Q_2] {$Q_3$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1131	\node[state, accepting] (Q_4) [right=of Q_1] {$Q_4$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1132	\path[->] (Q_0) edge node [above] {$a$} (Q_1);
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1133	\path[->] (Q_1) edge node [above] {$a$} (Q_4);
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1134	\path[->] (Q_4) edge [loop right] node {$a, b$} ();
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1135	\path[->] (Q_3) edge node [right] {$a$} (Q_4);
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1136	\path[->] (Q_2) edge node [above] {$a$} (Q_3);
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1137	\path[->] (Q_1) edge node [right] {$b$} (Q_2);
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1138	\path[->] (Q_0) edge node [above] {$b$} (Q_2);
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1139	\path[->] (Q_2) edge [loop left] node {$b$} ();
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1140	\path[->] (Q_3) edge [bend left=95, looseness=1.3] node
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1141	[below] {$b$} (Q_0);
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1142	\end{tikzpicture}
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1143	\end{center}
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1144
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1145	\noindent In Step 1 and 2 we consider essentially a triangle
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1146	of the form
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1147
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1148	\begin{center}
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1149	\begin{tikzpicture}[scale=0.6,line width=0.8mm]
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1150	\draw (0,0) -- (4,0);
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1151	\draw (0,1) -- (4,1);
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1152	\draw (0,2) -- (3,2);
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1153	\draw (0,3) -- (2,3);
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1154	\draw (0,4) -- (1,4);
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1155
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1156	\draw (0,0) -- (0, 4);
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1157	\draw (1,0) -- (1, 4);
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1158	\draw (2,0) -- (2, 3);
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1159	\draw (3,0) -- (3, 2);
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1160	\draw (4,0) -- (4, 1);
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1161
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1162	\draw (0.5,-0.5) node {$Q_0$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1163	\draw (1.5,-0.5) node {$Q_1$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1164	\draw (2.5,-0.5) node {$Q_2$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1165	\draw (3.5,-0.5) node {$Q_3$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1166
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1167	\draw (-0.5, 3.5) node {$Q_1$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1168	\draw (-0.5, 2.5) node {$Q_2$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1169	\draw (-0.5, 1.5) node {$Q_3$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1170	\draw (-0.5, 0.5) node {$Q_4$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1171
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1172	\draw (0.5,0.5) node {\large$\star$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1173	\draw (1.5,0.5) node {\large$\star$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1174	\draw (2.5,0.5) node {\large$\star$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1175	\draw (3.5,0.5) node {\large$\star$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1176	\end{tikzpicture}
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1177	\end{center}
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1178
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1179	\noindent where the lower row is filled with stars, because in
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1180	the corresponding pairs there is always one state that is
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1181	accepting ($Q_4$) and a state that is non-accepting (the other
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1182	states).
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1183
491 7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1184	In Step 3 we need to fill in more stars according whether
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1185	one of the next-state pairs are marked. We have to do this
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1186	for every unmarked field until there is no change anymore.
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1187	This gives the triangle
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1188
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1189	\begin{center}
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1190	\begin{tikzpicture}[scale=0.6,line width=0.8mm]
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1191	\draw (0,0) -- (4,0);
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1192	\draw (0,1) -- (4,1);
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1193	\draw (0,2) -- (3,2);
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1194	\draw (0,3) -- (2,3);
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1195	\draw (0,4) -- (1,4);
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1196
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1197	\draw (0,0) -- (0, 4);
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1198	\draw (1,0) -- (1, 4);
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1199	\draw (2,0) -- (2, 3);
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1200	\draw (3,0) -- (3, 2);
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1201	\draw (4,0) -- (4, 1);
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1202
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1203	\draw (0.5,-0.5) node {$Q_0$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1204	\draw (1.5,-0.5) node {$Q_1$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1205	\draw (2.5,-0.5) node {$Q_2$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1206	\draw (3.5,-0.5) node {$Q_3$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1207
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1208	\draw (-0.5, 3.5) node {$Q_1$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1209	\draw (-0.5, 2.5) node {$Q_2$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1210	\draw (-0.5, 1.5) node {$Q_3$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1211	\draw (-0.5, 0.5) node {$Q_4$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1212
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1213	\draw (0.5,0.5) node {\large$\star$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1214	\draw (1.5,0.5) node {\large$\star$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1215	\draw (2.5,0.5) node {\large$\star$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1216	\draw (3.5,0.5) node {\large$\star$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1217	\draw (0.5,1.5) node {\large$\star$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1218	\draw (2.5,1.5) node {\large$\star$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1219	\draw (0.5,3.5) node {\large$\star$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1220	\draw (1.5,2.5) node {\large$\star$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1221	\end{tikzpicture}
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1222	\end{center}
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1223
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1224	\noindent which means states $Q_0$ and $Q_2$, as well as $Q_1$
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1225	and $Q_3$ can be merged. This gives the following minimal DFA
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1226
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1227	\begin{center}
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1228	\begin{tikzpicture}[>=stealth',very thick,auto,
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1229	every state/.style={minimum size=0pt,
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1230	inner sep=2pt,draw=blue!50,very thick,
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1231	fill=blue!20}]
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1232	\node[state,initial] (Q_02) {$Q_{0, 2}$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1233	\node[state] (Q_13) [right=of Q_02] {$Q_{1, 3}$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1234	\node[state, accepting] (Q_4) [right=of Q_13]
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1235	{$Q_{4\phantom{,0}}$};
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1236	\path[->] (Q_02) edge [bend left] node [above] {$a$} (Q_13);
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1237	\path[->] (Q_13) edge [bend left] node [below] {$b$} (Q_02);
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1238	\path[->] (Q_02) edge [loop below] node {$b$} ();
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1239	\path[->] (Q_13) edge node [above] {$a$} (Q_4);
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1240	\path[->] (Q_4) edge [loop above] node {$a, b$} ();
344 408fd5994288 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 333 diff changeset	1241	\end{tikzpicture}
408fd5994288 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 333 diff changeset	1242	\end{center}
408fd5994288 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 333 diff changeset	1243
408fd5994288 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 333 diff changeset	1244
492 882d5de18adc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	1245	By the way, we are not bothering with implementing the above
882d5de18adc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	1246	minimisation algorith: while up to now all the transformations used
882d5de18adc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	1247	some clever composition of functions, the minimisation algorithm
882d5de18adc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	1248	cannot be implemented by just composing some functions. For this we
882d5de18adc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	1249	would require a more concrete representation of the transition
882d5de18adc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	1250	function (like maps). If we did this, however, then many advantages of
882d5de18adc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	1251	the functions would be thrown away. So the compromise is to not being
882d5de18adc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	1252	able to minimise (easily) our DFAs.
882d5de18adc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	1253
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1254	\subsection*{Brzozowski's Method}
269 83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1255
491 7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1256	I know tyhis is already a long, long rant: but after all it is a topic
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1257	that has been researched for more than 60 years. If you reflect on
492 882d5de18adc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	1258	what you have read so far, the story is that you can take a regular
491 7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1259	expression, translate it via the Thompson Construction into an
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1260	$\epsilon$NFA, then translate it into a NFA by removing all
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1261	$\epsilon$-transitions, and then via the subset construction obtain a
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1262	DFA. In all steps we made sure the language, or which strings can be
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1263	recognised, stays the same. After the last section, we can even
492 882d5de18adc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	1264	minimise the DFA (maybe not in code). But again we made sure the same
882d5de18adc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	1265	language is recognised. You might be wondering: Can we go into the
882d5de18adc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	1266	other direction? Can we go from a DFA and obtain a regular expression
882d5de18adc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	1267	that can recognise the same language as the DFA?\medskip
491 7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1268
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1269	\noindent
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1270	The answer is yes. Again there are several methods for calculating a
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1271	regular expression for a DFA. I will show you Brzozowski's method
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1272	because it calculates a regular expression using quite familiar
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1273	transformations for solving equational systems. Consider the DFA:
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1274
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1275	\begin{center}
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1276	\begin{tikzpicture}[scale=1.5,>=stealth',very thick,auto,
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1277	every state/.style={minimum size=0pt,
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1278	inner sep=2pt,draw=blue!50,very thick,
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1279	fill=blue!20}]
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1280	\node[state, initial] (q0) at ( 0,1) {$Q_0$};
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1281	\node[state] (q1) at ( 1,1) {$Q_1$};
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1282	\node[state, accepting] (q2) at ( 2,1) {$Q_2$};
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1283	\path[->] (q0) edge[bend left] node[above] {$a$} (q1)
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1284	(q1) edge[bend left] node[above] {$b$} (q0)
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1285	(q2) edge[bend left=50] node[below] {$b$} (q0)
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1286	(q1) edge node[above] {$a$} (q2)
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1287	(q2) edge [loop right] node {$a$} ()
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1288	(q0) edge [loop below] node {$b$} ();
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1289	\end{tikzpicture}
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1290	\end{center}
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1291
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1292	\noindent for which we can set up the following equational
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1293	system
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1294
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1295	\begin{eqnarray}
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1296	Q_0 & = & \ONE + Q_0\,b + Q_1\,b + Q_2\,b\\
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1297	Q_1 & = & Q_0\,a\\
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1298	Q_2 & = & Q_1\,a + Q_2\,a
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1299	\end{eqnarray}
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1300
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1301	\noindent There is an equation for each node in the DFA. Let
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1302	us have a look how the right-hand sides of the equations are
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1303	constructed. First have a look at the second equation: the
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1304	left-hand side is $Q_1$ and the right-hand side $Q_0\,a$. The
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1305	right-hand side is essentially all possible ways how to end up
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1306	in node $Q_1$. There is only one incoming edge from $Q_0$ consuming
322 698ed1c96cd0 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 318 diff changeset	1307	an $a$. Therefore the right hand side is this
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1308	state followed by character---in this case $Q_0\,a$. Now lets
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1309	have a look at the third equation: there are two incoming
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1310	edges for $Q_2$. Therefore we have two terms, namely $Q_1\,a$ and
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1311	$Q_2\,a$. These terms are separated by $+$. The first states
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1312	that if in state $Q_1$ consuming an $a$ will bring you to
485 21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	1313	$Q_2$, and the second that being in $Q_2$ and consuming an $a$
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1314	will make you stay in $Q_2$. The right-hand side of the
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1315	first equation is constructed similarly: there are three
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1316	incoming edges, therefore there are three terms. There is
444 3056a4c071b0 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 349 diff changeset	1317	one exception in that we also ``add'' $\ONE$ to the
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1318	first equation, because it corresponds to the starting state
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1319	in the DFA.
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1320
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1321	Having constructed the equational system, the question is
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1322	how to solve it? Remarkably the rules are very similar to
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1323	solving usual linear equational systems. For example the
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1324	second equation does not contain the variable $Q_1$ on the
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1325	right-hand side of the equation. We can therefore eliminate
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1326	$Q_1$ from the system by just substituting this equation
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1327	into the other two. This gives
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1328
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1329	\begin{eqnarray}
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1330	Q_0 & = & \ONE + Q_0\,b + Q_0\,a\,b + Q_2\,b\\
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1331	Q_2 & = & Q_0\,a\,a + Q_2\,a
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1332	\end{eqnarray}
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1333
485 21dec9df46ba updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	1334	\noindent where in Equation (4) we have two occurrences
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1335	of $Q_0$. Like the laws about $+$ and $\cdot$, we can simplify
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1336	Equation (4) to obtain the following two equations:
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1337
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1338	\begin{eqnarray}
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1339	Q_0 & = & \ONE + Q_0\,(b + a\,b) + Q_2\,b\\
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1340	Q_2 & = & Q_0\,a\,a + Q_2\,a
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1341	\end{eqnarray}
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1342
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1343	\noindent Unfortunately we cannot make any more progress with
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1344	substituting equations, because both (6) and (7) contain the
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1345	variable on the left-hand side also on the right-hand side.
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1346	Here we need to now use a law that is different from the usual
349 434891622131 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 344 diff changeset	1347	laws about linear equations. It is called \emph{Arden's rule}.
434891622131 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 344 diff changeset	1348	It states that if an equation is of the form $q = q\,r + s$
434891622131 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 344 diff changeset	1349	then it can be transformed to $q = s\, r^*$. Since we can
434891622131 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 344 diff changeset	1350	assume $+$ is symmetric, Equation (7) is of that form: $s$ is
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1351	$Q_0\,a\,a$ and $r$ is $a$. That means we can transform
349 434891622131 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 344 diff changeset	1352	(7) to obtain the two new equations
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1353
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1354	\begin{eqnarray}
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1355	Q_0 & = & \ONE + Q_0\,(b + a\,b) + Q_2\,b\\
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1356	Q_2 & = & Q_0\,a\,a\,(a^*)
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1357	\end{eqnarray}
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1358
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1359	\noindent Now again we can substitute the second equation into
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1360	the first in order to eliminate the variable $Q_2$.
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1361
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1362	\begin{eqnarray}
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1363	Q_0 & = & \ONE + Q_0\,(b + a\,b) + Q_0\,a\,a\,(a^*)\,b
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1364	\end{eqnarray}
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1365
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1366	\noindent Pulling $Q_0$ out as a single factor gives:
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1367
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1368	\begin{eqnarray}
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1369	Q_0 & = & \ONE + Q_0\,(b + a\,b + a\,a\,(a^*)\,b)
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1370	\end{eqnarray}
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1371
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1372	\noindent This equation is again of the form so that we can
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1373	apply Arden's rule ($r$ is $b + a\,b + a\,a\,(a^*)\,b$ and $s$
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1374	is $\ONE$). This gives as solution for $Q_0$ the following
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1375	regular expression:
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1376
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1377	\begin{eqnarray}
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1378	Q_0 & = & \ONE\,(b + a\,b + a\,a\,(a^)\,b)^
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1379	\end{eqnarray}
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1380
349 434891622131 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 344 diff changeset	1381	\noindent Since this is a regular expression, we can simplify
444 3056a4c071b0 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 349 diff changeset	1382	away the $\ONE$ to obtain the slightly simpler regular
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1383	expression
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1384
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1385	\begin{eqnarray}
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1386	Q_0 & = & (b + a\,b + a\,a\,(a^)\,b)^
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1387	\end{eqnarray}
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1388
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1389	\noindent
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1390	Now we can unwind this process and obtain the solutions
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1391	for the other equations. This gives:
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1392
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1393	\begin{eqnarray}
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1394	Q_0 & = & (b + a\,b + a\,a\,(a^)\,b)^\\
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1395	Q_1 & = & (b + a\,b + a\,a\,(a^)\,b)^\,a\\
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1396	Q_2 & = & (b + a\,b + a\,a\,(a^)\,b)^\,a\,a\,(a)^*
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1397	\end{eqnarray}
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1398
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1399	\noindent Finally, we only need to ``add'' up the equations
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1400	which correspond to a terminal state. In our running example,
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1401	this is just $Q_2$. Consequently, a regular expression
491 7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1402	that recognises the same language as the DFA is
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1403
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1404	\[
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1405	(b + a\,b + a\,a\,(a^)\,b)^\,a\,a\,(a)^*
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1406	\]
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1407
491 7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1408	\noindent You can somewhat crosscheck your solution by taking a string
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1409	the regular expression can match and and see whether it can be matched
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1410	by the DFA. One string for example is $aaa$ and \emph{voila} this
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1411	string is also matched by the automaton.
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1412
491 7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1413	We should prove that Brzozowski's method really produces an equivalent
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1414	regular expression. But for the purposes of this module, we omit
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1415	this. I guess you are relieved.
269 83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1416
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1417
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1418	\subsection*{Regular Languages}
269 83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1419
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1420	Given the constructions in the previous sections we obtain
349 434891622131 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 344 diff changeset	1421	the following overall picture:
269 83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1422
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1423	\begin{center}
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1424	\begin{tikzpicture}
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1425	\node (rexp) {\bf Regexps};
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1426	\node (nfa) [right=of rexp] {\bf NFAs};
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1427	\node (dfa) [right=of nfa] {\bf DFAs};
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1428	\node (mdfa) [right=of dfa] {\bf\begin{tabular}{c}minimal\\ DFAs\end{tabular}};
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1429	\path[->,line width=1mm] (rexp) edge node [above=4mm, black] {\begin{tabular}{c@{\hspace{9mm}}}Thompson's\\[-1mm] construction\end{tabular}} (nfa);
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1430	\path[->,line width=1mm] (nfa) edge node [above=4mm, black] {\begin{tabular}{c}subset\\[-1mm] construction\end{tabular}}(dfa);
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1431	\path[->,line width=1mm] (dfa) edge node [below=5mm, black] {minimisation} (mdfa);
344 408fd5994288 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 333 diff changeset	1432	\path[->,line width=1mm] (dfa) edge [bend left=45] node [below] {\begin{tabular}{l}Brzozowski's\\ method\end{tabular}} (rexp);
269 83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1433	\end{tikzpicture}
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1434	\end{center}
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1435
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1436	\noindent By going from regular expressions over NFAs to DFAs,
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1437	we can always ensure that for every regular expression there
349 434891622131 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 344 diff changeset	1438	exists a NFA and a DFA that can recognise the same language.
269 83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1439	Although we did not prove this fact. Similarly by going from
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1440	DFAs to regular expressions, we can make sure for every DFA
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1441	there exists a regular expression that can recognise the same
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1442	language. Again we did not prove this fact.
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1443
491 7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1444	The fundamental conclusion we can draw is that automata and regular
269 83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1445	expressions can recognise the same set of languages:
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1446
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1447	\begin{quote} A language is \emph{regular} iff there exists a
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1448	regular expression that recognises all its strings.
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1449	\end{quote}
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1450
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1451	\noindent or equivalently
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1452
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1453	\begin{quote} A language is \emph{regular} iff there exists an
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1454	automaton that recognises all its strings.
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1455	\end{quote}
268 18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	1456
491 7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1457	\noindent Note that this is not a stement for a particular language
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1458	(that is a particular set of strings), but about a large class of
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1459	languages, namely the regular ones.
268 18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	1460
491 7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1461	As a consequence for deciding whether a string is recognised by a
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1462	regular expression, we could use our algorithm based on derivatives or
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1463	NFAs or DFAs. But let us quickly look at what the differences mean in
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1464	computational terms. Translating a regular expression into a NFA gives
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1465	us an automaton that has $O(n)$ states---that means the size of the
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1466	NFA grows linearly with the size of the regular expression. The
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1467	problem with NFAs is that the problem of deciding whether a string is
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1468	accepted or not is computationally not cheap. Remember with NFAs we
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1469	have potentially many next states even for the same input and also
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1470	have the silent $\epsilon$-transitions. If we want to find a path from
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1471	the starting state of a NFA to an accepting state, we need to consider
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1472	all possibilities. In Ruby, Python and Java this is done by a
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1473	depth-first search, which in turn means that if a ``wrong'' choice is
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1474	made, the algorithm has to backtrack and thus explore all potential
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1475	candidates. This is exactly the reason why Ruby, Python and Java are
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1476	so slow for evil regular expressions. An alternative to the
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1477	potentially slow depth-first search is to explore the search space in
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1478	a breadth-first fashion, but this might incur a big memory penalty.
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1479
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1480	To avoid the problems with NFAs, we can translate them into DFAs. With
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1481	DFAs the problem of deciding whether a string is recognised or not is
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1482	much simpler, because in each state it is completely determined what
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1483	the next state will be for a given input. So no search is needed. The
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1484	problem with this is that the translation to DFAs can explode
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1485	exponentially the number of states. Therefore when this route is
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1486	taken, we definitely need to minimise the resulting DFAs in order to
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1487	have an acceptable memory and runtime behaviour. But remember the
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1488	subset construction in the worst case explodes the number of states by
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1489	$2^n$. Effectively also the translation to DFAs can incur a big
349 434891622131 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 344 diff changeset	1490	runtime penalty.
269 83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1491
491 7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1492	But this does not mean that everything is bad with automata. Recall
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1493	the problem of finding a regular expressions for the language that is
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1494	\emph{not} recognised by a regular expression. In our implementation
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1495	we added explicitly such a regular expressions because they are useful
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1496	for recognising comments. But in principle we did not need to. The
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1497	argument for this is as follows: take a regular expression, translate
349 434891622131 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 344 diff changeset	1498	it into a NFA and then a DFA that both recognise the same
491 7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1499	language. Once you have the DFA it is very easy to construct the
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1500	automaton for the language not recognised by a DFA. If the DFA is
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1501	completed (this is important!), then you just need to exchange the
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1502	accepting and non-accepting states. You can then translate this DFA
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1503	back into a regular expression and that will be the regular expression
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1504	that can match all strings the original regular expression could
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1505	\emph{not} match.
268 18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	1506
491 7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1507	It is also interesting that not all languages are regular. The most
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1508	well-known example of a language that is not regular consists of all
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1509	the strings of the form
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1510
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1511	\[a^n\,b^n\]
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1512
491 7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1513	\noindent meaning strings that have the same number of $a$s and
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1514	$b$s. You can try, but you cannot find a regular expression for this
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1515	language and also not an automaton. One can actually prove that there
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1516	is no regular expression nor automaton for this language, but again
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1517	that would lead us too far afield for what we want to do in this
7a0182c66403 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1518	module.
270 4dbeaf43031d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 269 diff changeset	1519
492 882d5de18adc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	1520
882d5de18adc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	1521	\subsection*{Where Have Derivatives Gone?}
882d5de18adc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	1522
882d5de18adc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	1523	By now you are probably fed up by this text. It is now way too long
882d5de18adc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	1524	for one lecture, but there is still one aspect of the
882d5de18adc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	1525	automata-connection I like to highlight for you. Perhaps by now you
882d5de18adc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	1526	are asking yourself: Where have the derivatives gone? Did we just
882d5de18adc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	1527	forget them? Well, they have a place in the picture of calculating a
882d5de18adc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	1528	DFA from the regular expression.
882d5de18adc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	1529
882d5de18adc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	1530	To be done
882d5de18adc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	1531
882d5de18adc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	1532
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1533	%\section*{Further Reading}
270 4dbeaf43031d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 269 diff changeset	1534
490 8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1535	%Compare what a ``human expert'' would create as an automaton for the
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1536	%regular expression $a\cdot (b + c)^*$ and what the Thomson
8a07f7256f2a updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1537	%algorithm generates.
325 794c599cee53 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 324 diff changeset	1538
794c599cee53 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 324 diff changeset	1539	%http://www.inf.ed.ac.uk/teaching/courses/ct/
140 1be892087df2 added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	1540	\end{document}
1be892087df2 added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	1541
1be892087df2 added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	1542	%%% Local Variables:
1be892087df2 added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	1543	%%% mode: latex
1be892087df2 added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	1544	%%% TeX-master: t
1be892087df2 added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	1545	%%% End:
482 74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1546
74149519e436 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1547

author	Christian Urban <urbanc@in.tum.de>
	Sun, 21 May 2017 00:46:21 +0100
changeset 493	3506b1718c08
parent 492	882d5de18adc
child 495	acd4567735ce
permissions	-rw-r--r--