afl-material: handouts/ho03.tex@bb54e7aa1a3f (annotated)

662 8da26d4c2ca8 updated Christian Urban <urbanc@in.tum.de> parents: 578 diff changeset	1	% !TEX program = xelatex
140 1be892087df2 added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	2	\documentclass{article}
251 5b5a68df6d16 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 217 diff changeset	3	\usepackage{../style}
5b5a68df6d16 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 217 diff changeset	4	\usepackage{../langs}
268 18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	5	\usepackage{../graphics}
140 1be892087df2 added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	6
480 9e42ccbbd1e6 updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	7
140 1be892087df2 added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	8	\begin{document}
874 ffe02fd574a5 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 764 diff changeset	9	\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2020, 2022}
140 1be892087df2 added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	10
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	11	\section*{Handout 3 (Finite Automata)}
140 1be892087df2 added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	12
488 598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	13
480 9e42ccbbd1e6 updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	14	Every formal language and compiler course I know of bombards you first
9e42ccbbd1e6 updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	15	with automata and then to a much, much smaller extend with regular
9e42ccbbd1e6 updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	16	expressions. As you can see, this course is turned upside down:
9e42ccbbd1e6 updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	17	regular expressions come first. The reason is that regular expressions
9e42ccbbd1e6 updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	18	are easier to reason about and the notion of derivatives, although
9e42ccbbd1e6 updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	19	already quite old, only became more widely known rather
488 598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	20	recently. Still, let us in this lecture have a closer look at automata
480 9e42ccbbd1e6 updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	21	and their relation to regular expressions. This will help us with
9e42ccbbd1e6 updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	22	understanding why the regular expression matchers in Python, Ruby and
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	23	Java are so slow with certain regular expressions. On the way we will
488 598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	24	also see what are the limitations of regular
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	25	expressions. Unfortunately, they cannot be used for \emph{everything}.
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	26
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	27
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	28	\subsection*{Deterministic Finite Automata}
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	29
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	30	Lets start\ldots the central definition is:\medskip
142 1aa28135a2da added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 141 diff changeset	31
268 18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	32	\noindent
251 5b5a68df6d16 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 217 diff changeset	33	A \emph{deterministic finite automaton} (DFA), say $A$, is
484 e61ffb28994d updated Christian Urban <urbanc@in.tum.de> parents: 483 diff changeset	34	given by a five-tuple written ${\cal A}(\varSigma, Qs, Q_0, F, \delta)$ where
142 1aa28135a2da added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 141 diff changeset	35
1aa28135a2da added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 141 diff changeset	36	\begin{itemize}
484 e61ffb28994d updated Christian Urban <urbanc@in.tum.de> parents: 483 diff changeset	37	\item $\varSigma$ is an alphabet,
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	38	\item $Qs$ is a finite set of states,
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	39	\item $Q_0 \in Qs$ is the start state,
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	40	\item $F \subseteq Qs$ are the accepting states, and
142 1aa28135a2da added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 141 diff changeset	41	\item $\delta$ is the transition function.
1aa28135a2da added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 141 diff changeset	42	\end{itemize}
1aa28135a2da added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 141 diff changeset	43
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	44	\noindent I am sure you have seen this definition already
488 598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	45	before. The transition function determines how to ``transition'' from
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	46	one state to the next state with respect to a character. We have the
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	47	assumption that these transition functions do not need to be defined
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	48	everywhere: so it can be the case that given a character there is no
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	49	next state, in which case we need to raise a kind of ``failure
483 6f508bcdaa30 updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	50	exception''. That means actually we have \emph{partial} functions as
495 7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	51	transitions---see the Scala implementation for DFAs later on. A
484 e61ffb28994d updated Christian Urban <urbanc@in.tum.de> parents: 483 diff changeset	52	typical example of a DFA is
142 1aa28135a2da added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 141 diff changeset	53
1aa28135a2da added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 141 diff changeset	54	\begin{center}
1aa28135a2da added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 141 diff changeset	55	\begin{tikzpicture}[>=stealth',very thick,auto,
268 18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	56	every state/.style={minimum size=0pt,
18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	57	inner sep=2pt,draw=blue!50,very thick,
18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	58	fill=blue!20},scale=2]
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	59	\node[state,initial] (Q_0) {$Q_0$};
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	60	\node[state] (Q_1) [right=of Q_0] {$Q_1$};
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	61	\node[state] (Q_2) [below right=of Q_0] {$Q_2$};
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	62	\node[state] (Q_3) [right=of Q_2] {$Q_3$};
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	63	\node[state, accepting] (Q_4) [right=of Q_1] {$Q_4$};
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	64	\path[->] (Q_0) edge node [above] {$a$} (Q_1);
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	65	\path[->] (Q_1) edge node [above] {$a$} (Q_4);
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	66	\path[->] (Q_4) edge [loop right] node {$a, b$} ();
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	67	\path[->] (Q_3) edge node [right] {$a$} (Q_4);
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	68	\path[->] (Q_2) edge node [above] {$a$} (Q_3);
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	69	\path[->] (Q_1) edge node [right] {$b$} (Q_2);
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	70	\path[->] (Q_0) edge node [above] {$b$} (Q_2);
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	71	\path[->] (Q_2) edge [loop left] node {$b$} ();
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	72	\path[->] (Q_3) edge [bend left=95, looseness=1.3] node [below] {$b$} (Q_0);
142 1aa28135a2da added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 141 diff changeset	73	\end{tikzpicture}
1aa28135a2da added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 141 diff changeset	74	\end{center}
140 1be892087df2 added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	75
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	76	\noindent In this graphical notation, the accepting state $Q_4$ is
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	77	indicated with double circles. Note that there can be more than one
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	78	accepting state. It is also possible that a DFA has no accepting
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	79	state at all, or that the starting state is also an accepting
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	80	state. In the case above the transition function is defined everywhere
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	81	and can also be given as a table as follows:
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	82
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	83	\[
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	84	\begin{array}{lcl}
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	85	(Q_0, a) &\rightarrow& Q_1\\
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	86	(Q_0, b) &\rightarrow& Q_2\\
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	87	(Q_1, a) &\rightarrow& Q_4\\
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	88	(Q_1, b) &\rightarrow& Q_2\\
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	89	(Q_2, a) &\rightarrow& Q_3\\
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	90	(Q_2, b) &\rightarrow& Q_2\\
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	91	(Q_3, a) &\rightarrow& Q_4\\
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	92	(Q_3, b) &\rightarrow& Q_0\\
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	93	(Q_4, a) &\rightarrow& Q_4\\
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	94	(Q_4, b) &\rightarrow& Q_4\\
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	95	\end{array}
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	96	\]
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	97
495 7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	98	\noindent
7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	99	Please check that this table represents the same transition function
7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	100	as the graph above.
7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	101
268 18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	102	We need to define the notion of what language is accepted by
18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	103	an automaton. For this we lift the transition function
18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	104	$\delta$ from characters to strings as follows:
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	105
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	106	\[
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	107	\begin{array}{lcl}
484 e61ffb28994d updated Christian Urban <urbanc@in.tum.de> parents: 483 diff changeset	108	\widehat{\delta}(q, []) & \dn & q\\
e61ffb28994d updated Christian Urban <urbanc@in.tum.de> parents: 483 diff changeset	109	\widehat{\delta}(q, c\!::\!s) & \dn & \widehat{\delta}(\delta(q, c), s)\\
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	110	\end{array}
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	111	\]
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	112
268 18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	113	\noindent This lifted transition function is often called
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	114	\emph{delta-hat}. Given a string, we start in the starting state and
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	115	take the first character of the string, follow to the next state, then
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	116	take the second character and so on. Once the string is exhausted and
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	117	we end up in an accepting state, then this string is accepted by the
480 9e42ccbbd1e6 updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	118	automaton. Otherwise it is not accepted. This also means that if along
9e42ccbbd1e6 updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	119	the way we hit the case where the transition function $\delta$ is not
9e42ccbbd1e6 updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	120	defined, we need to raise an error. In our implementation we will deal
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	121	with this case elegantly by using Scala's \texttt{Try}. Summing up: a
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	122	string $s$ is in the \emph{language accepted by the automaton} ${\cal
484 e61ffb28994d updated Christian Urban <urbanc@in.tum.de> parents: 483 diff changeset	123	A}(\varSigma, Q, Q_0, F, \delta)$ iff
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	124
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	125	\[
484 e61ffb28994d updated Christian Urban <urbanc@in.tum.de> parents: 483 diff changeset	126	\widehat{\delta}(Q_0, s) \in F
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	127	\]
268 18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	128
488 598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	129	\noindent I let you think about a definition that describes the set of
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	130	all strings accepted by a deterministic finite automaton.
480 9e42ccbbd1e6 updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	131
9e42ccbbd1e6 updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	132	\begin{figure}[p]
9e42ccbbd1e6 updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	133	\small
753 d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	134	\lstinputlisting[numbers=left,lastline=43]{../progs/automata/dfa.sc}
572 4a1739f256fd updated Christian Urban <urbanc@in.tum.de> parents: 556 diff changeset	135	\caption{An implementation of DFAs in Scala using partial functions.
488 598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	136	Note some subtleties: \texttt{deltas} implements the delta-hat
487 a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	137	construction by lifting the (partial) transition function to lists
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	138	of characters. Since \texttt{delta} is given as a partial function,
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	139	it can obviously go ``wrong'' in which case the \texttt{Try} in
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	140	\texttt{accepts} catches the error and returns \texttt{false}---that
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	141	means the string is not accepted. The example \texttt{delta} in
753 d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	142	Line 22--43 implements the DFA example shown earlier in the
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	143	handout.\label{dfa}}
480 9e42ccbbd1e6 updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	144	\end{figure}
9e42ccbbd1e6 updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	145
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	146	My take on an implementation of DFAs in Scala is given in
483 6f508bcdaa30 updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	147	Figure~\ref{dfa}. As you can see, there are many features of the
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	148	mathematical definition that are quite closely reflected in the
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	149	code. In the DFA-class, there is a starting state, called
483 6f508bcdaa30 updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	150	\texttt{start}, with the polymorphic type \texttt{A}. There is a
6f508bcdaa30 updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	151	partial function \texttt{delta} for specifying the transitions---these
6f508bcdaa30 updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	152	partial functions take a state (of polymorphic type \texttt{A}) and an
6f508bcdaa30 updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	153	input (of polymorphic type \texttt{C}) and produce a new state (of
6f508bcdaa30 updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	154	type \texttt{A}). For the moment it is OK to assume that \texttt{A} is
6f508bcdaa30 updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	155	some arbitrary type for states and the input is just characters. (The
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	156	reason for not having concrete types, but polymorphic types for the
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	157	states and the input of DFAs will become clearer later on.)
483 6f508bcdaa30 updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	158
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	159	The DFA-class has also an argument for specifying final states. In the
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	160	implementation it is not a set of states, as in the mathematical
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	161	definition, but a function from states to booleans (this function is
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	162	supposed to return true whenever a state is final; false
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	163	otherwise). While this boolean function is different from the sets of
753 d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	164	states, Scala allows us to use sets for such functions (see Line 41 where
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	165	the DFA is initialised). Again it will become clear later on why I use
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	166	functions for final states, rather than sets.
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	167
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	168	The most important point in the implementation is that I use Scala's
484 e61ffb28994d updated Christian Urban <urbanc@in.tum.de> parents: 483 diff changeset	169	partial functions for representing the transitions; alternatives would
e61ffb28994d updated Christian Urban <urbanc@in.tum.de> parents: 483 diff changeset	170	have been \texttt{Maps} or even \texttt{Lists}. One of the main
e61ffb28994d updated Christian Urban <urbanc@in.tum.de> parents: 483 diff changeset	171	advantages of using partial functions is that transitions can be quite
753 d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	172	nicely defined by a series of \texttt{case} statements (see Lines 29
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	173	-- 39 for an example). If you need to represent an automaton with a
484 e61ffb28994d updated Christian Urban <urbanc@in.tum.de> parents: 483 diff changeset	174	sink state (catch-all-state), you can use Scala's pattern matching and
e61ffb28994d updated Christian Urban <urbanc@in.tum.de> parents: 483 diff changeset	175	write something like
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	176
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	177	{\small\begin{lstlisting}[language=Scala]
480 9e42ccbbd1e6 updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	178	abstract class State
9e42ccbbd1e6 updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	179	...
9e42ccbbd1e6 updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	180	case object Sink extends State
9e42ccbbd1e6 updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	181
9e42ccbbd1e6 updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	182	val delta : (State, Char) :=> State =
9e42ccbbd1e6 updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	183	{ case (S0, 'a') => S1
9e42ccbbd1e6 updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	184	case (S1, 'a') => S2
9e42ccbbd1e6 updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	185	case _ => Sink
9e42ccbbd1e6 updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	186	}
9e42ccbbd1e6 updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	187	\end{lstlisting}}
9e42ccbbd1e6 updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	188
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	189	\noindent I let you think what the corresponding DFA looks like in the
497 c498cb53a9a8 updated Christian Urban <urbanc@in.tum.de> parents: 495 diff changeset	190	graph notation. Also, I suggest you to tinker with the Scala code in
c498cb53a9a8 updated Christian Urban <urbanc@in.tum.de> parents: 495 diff changeset	191	order to define the DFA that does not accept any string at all.
484 e61ffb28994d updated Christian Urban <urbanc@in.tum.de> parents: 483 diff changeset	192
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	193	Finally, I let you ponder whether this is a good implementation of
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	194	DFAs or not. In doing so I hope you notice that the $\varSigma$ and
572 4a1739f256fd updated Christian Urban <urbanc@in.tum.de> parents: 556 diff changeset	195	$Qs$ components (the alphabet and the set of \emph{finite} states,
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	196	respectively) are missing from the class definition. This means that
572 4a1739f256fd updated Christian Urban <urbanc@in.tum.de> parents: 556 diff changeset	197	the implementation allows you to do some ``fishy'' things you are not
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	198	meant to do with DFAs. Which fishy things could that be?
480 9e42ccbbd1e6 updated Christian Urban <urbanc@in.tum.de> parents: 471 diff changeset	199
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	200
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	201
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	202	\subsection*{Non-Deterministic Finite Automata}
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	203
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	204	Remember we want to find out what the relation is between regular
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	205	expressions and automata. To do this with DFAs is a bit unwieldy.
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	206	While with DFAs it is always clear that given a state and a character
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	207	what the next state is (potentially none), it will be convenient to
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	208	relax this restriction. That means we allow states to have several
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	209	potential successor states. We even allow more than one starting
484 e61ffb28994d updated Christian Urban <urbanc@in.tum.de> parents: 483 diff changeset	210	state. The resulting construction is called a \emph{Non-Deterministic
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	211	Finite Automaton} (NFA) given also as a five-tuple ${\cal
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	212	A}(\varSigma, Qs, Q_{0s}, F, \rho)$ where
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	213
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	214	\begin{itemize}
484 e61ffb28994d updated Christian Urban <urbanc@in.tum.de> parents: 483 diff changeset	215	\item $\varSigma$ is an alphabet,
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	216	\item $Qs$ is a finite set of states
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	217	\item $Q_{0s}$ is a set of start states ($Q_{0s} \subseteq Qs$)
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	218	\item $F$ are some accepting states with $F \subseteq Qs$, and
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	219	\item $\rho$ is a transition relation.
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	220	\end{itemize}
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	221
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	222	\noindent
483 6f508bcdaa30 updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	223	A typical example of a NFA is
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	224
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	225	% A NFA for (ab* + b)*a
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	226	\begin{center}
483 6f508bcdaa30 updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	227	\begin{tikzpicture}[>=stealth',very thick, auto,
6f508bcdaa30 updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	228	every state/.style={minimum size=0pt,inner sep=3pt,
6f508bcdaa30 updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	229	draw=blue!50,very thick,fill=blue!20},scale=2]
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	230	\node[state,initial] (Q_0) {$Q_0$};
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	231	\node[state] (Q_1) [right=of Q_0] {$Q_1$};
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	232	\node[state, accepting] (Q_2) [right=of Q_1] {$Q_2$};
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	233	\path[->] (Q_0) edge [loop above] node {$b$} ();
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	234	\path[<-] (Q_0) edge node [below] {$b$} (Q_1);
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	235	\path[->] (Q_0) edge [bend left] node [above] {$a$} (Q_1);
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	236	\path[->] (Q_0) edge [bend right] node [below] {$a$} (Q_2);
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	237	\path[->] (Q_1) edge [loop above] node {$a,b$} ();
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	238	\path[->] (Q_1) edge node [above] {$a$} (Q_2);
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	239	\end{tikzpicture}
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	240	\end{center}
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	241
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	242	\noindent
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	243	This NFA happens to have only one starting state, but in general there
495 7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	244	could be more than one. Notice that in state $Q_0$ we might go to
7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	245	state $Q_1$ \emph{or} to state $Q_2$ when receiving an $a$. Similarly
7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	246	in state $Q_1$ and receiving an $a$, we can stay in state $Q_1$
7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	247	\emph{or} go to $Q_2$. This kind of choice is not allowed with
7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	248	DFAs. The downside of this choice in NFAs is that when it comes to
7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	249	deciding whether a string is accepted by a NFA we potentially have to
7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	250	explore all possibilities. I let you think which strings the above NFA
7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	251	accepts.
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	252
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	253
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	254	There are a number of additional points you should note about
483 6f508bcdaa30 updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	255	NFAs. Every DFA is a NFA, but not vice versa. The $\rho$ in NFAs is a
6f508bcdaa30 updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	256	transition \emph{relation} (DFAs have a transition function). The
6f508bcdaa30 updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	257	difference between a function and a relation is that a function has
6f508bcdaa30 updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	258	always a single output, while a relation gives, roughly speaking,
6f508bcdaa30 updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	259	several outputs. Look again at the NFA above: if you are currently in
6f508bcdaa30 updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	260	the state $Q_1$ and you read a character $b$, then you can transition
6f508bcdaa30 updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	261	to either $Q_0$ \emph{or} $Q_2$. Which route, or output, you take is
6f508bcdaa30 updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	262	not determined. This non-determinism can be represented by a
6f508bcdaa30 updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	263	relation.
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	264
483 6f508bcdaa30 updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	265	My implementation of NFAs in Scala is shown in Figure~\ref{nfa}.
6f508bcdaa30 updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	266	Perhaps interestingly, I do not actually use relations for my NFAs,
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	267	but use transition functions that return sets of states. DFAs have
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	268	partial transition functions that return a single state; my NFAs
488 598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	269	return a set of states. I let you think about this representation for
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	270	NFA-transitions and how it corresponds to the relations used in the
487 a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	271	mathematical definition of NFAs. An example of a transition function
488 598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	272	in Scala for the NFA shown above is
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	273
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	274	{\small\begin{lstlisting}[language=Scala]
487 a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	275	val nfa_delta : (State, Char) :=> Set[State] =
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	276	{ case (Q0, 'a') => Set(Q1, Q2)
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	277	case (Q0, 'b') => Set(Q0)
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	278	case (Q1, 'a') => Set(Q1, Q2)
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	279	case (Q1, 'b') => Set(Q0, Q1) }
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	280	\end{lstlisting}}
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	281
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	282	Like in the mathematical definition, \texttt{starts} is in
487 a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	283	NFAs a set of states; \texttt{fins} is again a function from states to
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	284	booleans. The \texttt{next} function calculates the set of next states
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	285	reachable from a single state \texttt{q} by a character~\texttt{c}. In
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	286	case there is no such state---the partial transition function is
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	287	undefined---the empty set is returned (see function
753 d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	288	\texttt{applyOrElse} in Lines 11 and 12). The function \texttt{nexts}
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	289	just lifts this function to sets of states.
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	290
484 e61ffb28994d updated Christian Urban <urbanc@in.tum.de> parents: 483 diff changeset	291	\begin{figure}[p]
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	292	\small
753 d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	293	\lstinputlisting[numbers=left,lastline=43]{../progs/automata/nfa.sc}
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	294	\caption{A Scala implementation of NFAs using partial functions.
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	295	Notice that the function \texttt{accepts} implements the
556 40e22ad45744 updated Christian Urban <urbanc@in.tum.de> parents: 518 diff changeset	296	acceptance of a string in a breadth-first search fashion. This can be a costly
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	297	way of deciding whether a string is accepted or not in applications that need to handle
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	298	large NFAs and large inputs.\label{nfa}}
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	299	\end{figure}
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	300
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	301	Look very careful at the \texttt{accepts} and \texttt{deltas}
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	302	functions in NFAs and remember that when accepting a string by a NFA
484 e61ffb28994d updated Christian Urban <urbanc@in.tum.de> parents: 483 diff changeset	303	we might have to explore all possible transitions (recall which state
753 d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	304	to go to is not unique any more with NFAs\ldots{}we need to explore
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	305	potentially all next states). The implementation achieves this
487 a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	306	exploration through a \emph{breadth-first search}. This is fine for
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	307	small NFAs, but can lead to real memory problems when the NFAs are
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	308	bigger and larger strings need to be processed. As result, some
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	309	regular expression matching engines resort to a \emph{depth-first
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	310	search} with \emph{backtracking} in unsuccessful cases. In our
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	311	implementation we can implement a depth-first version of
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	312	\texttt{accepts} using Scala's \texttt{exists}-function as follows:
483 6f508bcdaa30 updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	313
6f508bcdaa30 updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	314
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	315	{\small\begin{lstlisting}[language=Scala]
483 6f508bcdaa30 updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	316	def search(q: A, s: List[C]) : Boolean = s match {
6f508bcdaa30 updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	317	case Nil => fins(q)
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	318	case c::cs => next(q, c).exists(search(_, cs))
483 6f508bcdaa30 updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	319	}
6f508bcdaa30 updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	320
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	321	def accepts2(s: List[C]) : Boolean =
483 6f508bcdaa30 updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	322	starts.exists(search(_, s))
6f508bcdaa30 updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	323	\end{lstlisting}}
6f508bcdaa30 updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	324
6f508bcdaa30 updated Christian Urban <urbanc@in.tum.de> parents: 482 diff changeset	325	\noindent
487 a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	326	This depth-first way of exploration seems to work quite efficiently in
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	327	many examples and is much less of a strain on memory. The problem is
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	328	that the backtracking can get ``catastrophic'' in some
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	329	examples---remember the catastrophic backtracking from earlier
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	330	lectures. This depth-first search with backtracking is the reason for
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	331	the abysmal performance of some regular expression matchings in Java,
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	332	Ruby and Python. I like to show you this in the next two sections.
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	333
268 18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	334
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	335	\subsection*{Epsilon NFAs}
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	336
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	337	In order to get an idea what calculations are performed by Java \&
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	338	friends, we need a method for transforming a regular expression into
753 d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	339	a corresponding automaton. This automaton should accept exactly those strings that
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	340	are accepted by the regular expression. The simplest and most
753 d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	341	well-known method for this is called the \emph{Thompson Construction},
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	342	after the Turing Award winner Ken Thompson. This method is by
487 a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	343	recursion over regular expressions and depends on the non-determinism
488 598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	344	in NFAs described in the previous section. You will see shortly why
487 a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	345	this construction works well with NFAs, but is not so straightforward
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	346	with DFAs.
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	347
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	348	Unfortunately we are still one step away from our intended target
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	349	though---because this construction uses a version of NFAs that allows
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	350	``silent transitions''. The idea behind silent transitions is that
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	351	they allow us to go from one state to the next without having to
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	352	consume a character. We label such silent transition with the letter
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	353	$\epsilon$ and call the automata $\epsilon$NFAs. Two typical examples
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	354	of $\epsilon$NFAs are:
484 e61ffb28994d updated Christian Urban <urbanc@in.tum.de> parents: 483 diff changeset	355
e61ffb28994d updated Christian Urban <urbanc@in.tum.de> parents: 483 diff changeset	356
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	357	\begin{center}
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	358	\begin{tabular}[t]{c@{\hspace{9mm}}c}
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	359	\begin{tikzpicture}[>=stealth',very thick,
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	360	every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	361	\node[state,initial] (Q_0) {$Q_0$};
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	362	\node[state] (Q_1) [above=of Q_0] {$Q_1$};
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	363	\node[state, accepting] (Q_2) [below=of Q_0] {$Q_2$};
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	364	\path[->] (Q_0) edge node [left] {$\epsilon$} (Q_1);
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	365	\path[->] (Q_0) edge node [left] {$\epsilon$} (Q_2);
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	366	\path[->] (Q_0) edge [loop right] node {$a$} ();
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	367	\path[->] (Q_1) edge [loop right] node {$a$} ();
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	368	\path[->] (Q_2) edge [loop right] node {$b$} ();
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	369	\end{tikzpicture} &
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	370
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	371	\raisebox{20mm}{
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	372	\begin{tikzpicture}[>=stealth',very thick,
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	373	every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	374	\node[state,initial] (r_1) {$R_1$};
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	375	\node[state] (r_2) [above=of r_1] {$R_2$};
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	376	\node[state, accepting] (r_3) [right=of r_1] {$R_3$};
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	377	\path[->] (r_1) edge node [below] {$b$} (r_3);
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	378	\path[->] (r_2) edge [bend left] node [above] {$a$} (r_3);
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	379	\path[->] (r_1) edge [bend left] node [left] {$\epsilon$} (r_2);
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	380	\path[->] (r_2) edge [bend left] node [right] {$a$} (r_1);
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	381	\end{tikzpicture}}
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	382	\end{tabular}
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	383	\end{center}
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	384
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	385	\noindent
487 a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	386	Consider the $\epsilon$NFA on the left-hand side: the
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	387	$\epsilon$-transitions mean you do not have to ``consume'' any part of
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	388	the input string, but ``silently'' change to a different state. In
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	389	this example, if you are in the starting state $Q_0$, you can silently
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	390	move either to $Q_1$ or $Q_2$. You can see that once you are in $Q_1$,
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	391	respectively $Q_2$, you cannot ``go back'' to the other states. So it
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	392	seems allowing $\epsilon$-transitions is a rather substantial
487 a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	393	extension to NFAs. On first appearances, $\epsilon$-transitions might
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	394	even look rather strange, or even silly. To start with, silent
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	395	transitions make the decision whether a string is accepted by an
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	396	automaton even harder: with $\epsilon$NFAs we have to look whether we
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	397	can do first some $\epsilon$-transitions and then do a
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	398	``proper''-transition; and after any ``proper''-transition we again
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	399	have to check whether we can do again some silent transitions. Even
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	400	worse, if there is a silent transition pointing back to the same
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	401	state, then we have to be careful our decision procedure for strings
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	402	does not loop (remember the depth-first search for exploring all
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	403	states).
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	404
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	405	The obvious question is: Do we get anything in return for this hassle
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	406	with silent transitions? Well, we still have to work for it\ldots
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	407	unfortunately. If we were to follow the many textbooks on the
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	408	subject, we would now start with defining what $\epsilon$NFAs
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	409	are---that would require extending the transition relation of
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	410	NFAs. Next, we would show that the $\epsilon$NFAs are equivalent to
488 598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	411	NFAs and so on. Once we have done all this on paper, we would need to
753 d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	412	implement $\epsilon$NFAs. Let's try to take a shortcut instead. We are
488 598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	413	not really interested in $\epsilon$NFAs; they are only a convenient
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	414	tool for translating regular expressions into automata. So we are not
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	415	going to implementing them explicitly, but translate them immediately
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	416	into NFAs (in a sense $\epsilon$NFAs are just a convenient API for
753 d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	417	lazy people ;o). How does this translation work? Well we have to find
488 598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	418	all transitions of the form
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	419
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	420	\[
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	421	q\stackrel{\epsilon}{\longrightarrow}\ldots\stackrel{\epsilon}{\longrightarrow}
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	422	\;\stackrel{a}{\longrightarrow}\;
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	423	\stackrel{\epsilon}{\longrightarrow}\ldots\stackrel{\epsilon}{\longrightarrow} q'
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	424	\]
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	425
753 d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	426	\noindent where somewhere in the ``middle'' is an $a$-transition (for
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	427	a character, say, $a$). We replace them with
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	428	$q \stackrel{a}{\longrightarrow} q'$. Doing this to the $\epsilon$NFA
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	429	on the right-hand side above gives the NFA
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	430
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	431	\begin{center}
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	432	\begin{tikzpicture}[>=stealth',very thick,
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	433	every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	434	\node[state,initial] (r_1) {$R_1$};
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	435	\node[state] (r_2) [above=of r_1] {$R_2$};
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	436	\node[state, accepting] (r_3) [right=of r_1] {$R_3$};
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	437	\path[->] (r_1) edge node [above] {$b$} (r_3);
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	438	\path[->] (r_2) edge [bend left] node [above] {$a$} (r_3);
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	439	\path[->] (r_1) edge [bend left] node [left] {$a$} (r_2);
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	440	\path[->] (r_2) edge [bend left] node [right] {$a$} (r_1);
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	441	\path[->] (r_1) edge [loop below] node {$a$} ();
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	442	\path[->] (r_1) edge [bend right] node [below] {$a$} (r_3);
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	443	\end{tikzpicture}
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	444	\end{center}
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	445
487 a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	446	\noindent where the single $\epsilon$-transition is replaced by
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	447	three additional $a$-transitions. Please do the calculations yourself
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	448	and verify that I did not forget any transition.
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	449
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	450	So in what follows, whenever we are given an $\epsilon$NFA we will
488 598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	451	replace it by an equivalent NFA. The Scala code for this translation
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	452	is given in Figure~\ref{enfa}. The main workhorse in this code is a
753 d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	453	function that calculates a fixpoint of function (Lines 6--12). This
488 598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	454	function is used for ``discovering'' which states are reachable by
487 a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	455	$\epsilon$-transitions. Once no new state is discovered, a fixpoint is
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	456	reached. This is used for example when calculating the starting states
753 d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	457	of an equivalent NFA (see Line 28): we start with all starting states
487 a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	458	of the $\epsilon$NFA and then look for all additional states that can
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	459	be reached by $\epsilon$-transitions. We keep on doing this until no
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	460	new state can be reached. This is what the $\epsilon$-closure, named
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	461	in the code \texttt{ecl}, calculates. Similarly, an accepting state of
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	462	the NFA is when we can reach an accepting state of the $\epsilon$NFA
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	463	by $\epsilon$-transitions.
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	464
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	465
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	466	\begin{figure}[p]
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	467	\small
753 d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	468	\lstinputlisting[numbers=left,lastline=43]{../progs/automata/enfa.sc}
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	469
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	470	\caption{A Scala function that translates $\epsilon$NFA into NFAs. The
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	471	transition function of $\epsilon$NFA takes as input an \texttt{Option[C]}.
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	472	\texttt{None} stands for an $\epsilon$-transition; \texttt{Some(c)}
488 598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	473	for a ``proper'' transition consuming a character. The functions in
753 d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	474	Lines 19--24 calculate
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	475	all states reachable by one or more $\epsilon$-transition for a given
753 d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	476	set of states. The NFA is constructed in Lines 30--34.
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	477	Note the interesting commands in Lines 7 and 8: their purpose is
491 d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	478	to ensure that \texttt{fixpT} is the tail-recursive version of
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	479	the fixpoint construction; otherwise we would quickly get a
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	480	stack-overflow exception, even on small examples, due to limitations
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	481	of the JVM.
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	482	\label{enfa}}
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	483	\end{figure}
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	484
487 a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	485	Also look carefully how the transitions of $\epsilon$NFAs are
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	486	implemented. The additional possibility of performing silent
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	487	transitions is encoded by using \texttt{Option[C]} as the type for the
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	488	``input''. The \texttt{Some}s stand for ``proper'' transitions where
487 a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	489	a character is consumed; \texttt{None} stands for
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	490	$\epsilon$-transitions. The transition functions for the two
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	491	$\epsilon$NFAs from the beginning of this section can be defined as
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	492
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	493	{\small\begin{lstlisting}[language=Scala]
487 a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	494	val enfa_trans1 : (State, Option[Char]) :=> Set[State] =
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	495	{ case (Q0, Some('a')) => Set(Q0)
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	496	case (Q0, None) => Set(Q1, Q2)
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	497	case (Q1, Some('a')) => Set(Q1)
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	498	case (Q2, Some('b')) => Set(Q2) }
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	499
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	500	val enfa_trans2 : (State, Option[Char]) :=> Set[State] =
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	501	{ case (R1, Some('b')) => Set(R3)
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	502	case (R1, None) => Set(R2)
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	503	case (R2, Some('a')) => Set(R1, R3) }
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	504	\end{lstlisting}}
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	505
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	506	\noindent
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	507	I hope you agree now with my earlier statement that the $\epsilon$NFAs
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	508	are just an API for NFAs.
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	509
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	510	\subsection*{Thompson Construction}
487 a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	511
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	512	Having the translation of $\epsilon$NFAs to NFAs in place, we can
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	513	finally return to the problem of translating regular expressions into
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	514	equivalent NFAs. Recall that by equivalent we mean that the NFAs
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	515	recognise the same language. Consider the simple regular expressions
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	516	$\ZERO$, $\ONE$ and $c$. They can be translated into equivalent NFAs
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	517	as follows:
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	518
488 598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	519	\begin{equation}\mbox{
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	520	\begin{tabular}[t]{l@{\hspace{10mm}}l}
444 3056a4c071b0 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 349 diff changeset	521	\raisebox{1mm}{$\ZERO$} &
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	522	\begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	523	\node[state, initial] (Q_0) {$\mbox{}$};
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	524	\end{tikzpicture}\\\\
444 3056a4c071b0 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 349 diff changeset	525	\raisebox{1mm}{$\ONE$} &
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	526	\begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	527	\node[state, initial, accepting] (Q_0) {$\mbox{}$};
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	528	\end{tikzpicture}\\\\
487 a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	529	\raisebox{3mm}{$c$} &
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	530	\begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	531	\node[state, initial] (Q_0) {$\mbox{}$};
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	532	\node[state, accepting] (Q_1) [right=of Q_0] {$\mbox{}$};
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	533	\path[->] (Q_0) edge node [below] {$c$} (Q_1);
487 a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	534	\end{tikzpicture}\\
488 598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	535	\end{tabular}}\label{simplecases}
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	536	\end{equation}
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	537
487 a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	538	\noindent
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	539	I let you think whether the NFAs can match exactly those strings the
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	540	regular expressions can match. To do this translation in code we need
495 7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	541	a way to construct states ``programatically''...and as an additional
7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	542	constraint Scala needs to recognise that these states are being distinct.
487 a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	543	For this I implemented in Figure~\ref{thompson1} a class
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	544	\texttt{TState} that includes a counter and a companion object that
488 598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	545	increases this counter whenever a new state is created.\footnote{You might
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	546	have to read up what \emph{companion objects} do in Scala.}
487 a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	547
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	548	\begin{figure}[p]
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	549	\small
753 d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	550	\lstinputlisting[numbers=left,linerange={1-20}]{../progs/automata/thompson.sc}
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	551	\hspace{5mm}\texttt{\dots}
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	552	\lstinputlisting[numbers=left,linerange={28-45},firstnumber=28]{../progs/automata/thompson.sc}
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	553	\caption{The first part of the Thompson Construction. Lines 10--19
488 598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	554	implement a way of how to create new states that are all
487 a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	555	distinct by virtue of a counter. This counter is
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	556	increased in the companion object of \texttt{TState}
753 d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	557	whenever a new state is created. The code in Lines 38--45
488 598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	558	constructs NFAs for the simple regular expressions $\ZERO$, $\ONE$ and $c$.
495 7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	559	Compare this code with the pictures given in \eqref{simplecases} on
7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	560	Page~\pageref{simplecases}.
487 a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	561	\label{thompson1}}
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	562	\end{figure}
bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	563
487 a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	564	\begin{figure}[p]
a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	565	\small
753 d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	566	\lstinputlisting[numbers=left,firstline=48,firstnumber=48,lastline=85]{../progs/automata/thompson.sc}
487 a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	567	\caption{The second part of the Thompson Construction implementing
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	568	the composition of NFAs according to $\cdot$, $+$ and ${}^*$.
753 d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	569	The implicit class (Lines 48--54) about rich partial functions
487 a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	570	implements the infix operation \texttt{+++} which
753 d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	571	combines an $\epsilon$NFA transition with an NFA transition
495 7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	572	(both are given as partial functions---but with different type!).\label{thompson2}}
487 a697421eaa04 updated Christian Urban <urbanc@in.tum.de> parents: 485 diff changeset	573	\end{figure}
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	574
488 598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	575	The case for the sequence regular expression $r_1 \cdot r_2$ is a bit more
489 e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	576	complicated: Say, we are given by recursion two NFAs representing the regular
488 598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	577	expressions $r_1$ and $r_2$ respectively.
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	578
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	579	\begin{center}
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	580	\begin{tikzpicture}[node distance=3mm,
488 598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	581	>=stealth',very thick,
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	582	every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	583	\node[state, initial] (Q_0) {$\mbox{}$};
488 598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	584	\node[state, initial] (Q_01) [below=1mm of Q_0] {$\mbox{}$};
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	585	\node[state, initial] (Q_02) [above=1mm of Q_0] {$\mbox{}$};
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	586	\node (R_1) [right=of Q_0] {$\ldots$};
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	587	\node[state, accepting] (T_1) [right=of R_1] {$\mbox{}$};
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	588	\node[state, accepting] (T_2) [above=of T_1] {$\mbox{}$};
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	589	\node[state, accepting] (T_3) [below=of T_1] {$\mbox{}$};
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	590
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	591	\node (A_0) [right=2.5cm of T_1] {$\mbox{}$};
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	592	\node[state, initial] (A_01) [above=1mm of A_0] {$\mbox{}$};
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	593	\node[state, initial] (A_02) [below=1mm of A_0] {$\mbox{}$};
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	594
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	595	\node (b_1) [right=of A_0] {$\ldots$};
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	596	\node[state, accepting] (c_1) [right=of b_1] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	597	\node[state, accepting] (c_2) [above=of c_1] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	598	\node[state, accepting] (c_3) [below=of c_1] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	599	\begin{pgfonlayer}{background}
488 598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	600	\node (1) [rounded corners, inner sep=1mm, thick,
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	601	draw=black!60, fill=black!20, fit= (Q_0) (R_1) (T_1) (T_2) (T_3)] {};
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	602	\node (2) [rounded corners, inner sep=1mm, thick,
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	603	draw=black!60, fill=black!20, fit= (A_0) (b_1) (c_1) (c_2) (c_3)] {};
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	604	\node [yshift=2mm] at (1.north) {$r_1$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	605	\node [yshift=2mm] at (2.north) {$r_2$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	606	\end{pgfonlayer}
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	607	\end{tikzpicture}
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	608	\end{center}
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	609
488 598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	610	\noindent The first NFA has some accepting states and the second some
489 e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	611	starting states. We obtain an $\epsilon$NFA for $r_1\cdot r_2$ by
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	612	connecting the accepting states of the first NFA with
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	613	$\epsilon$-transitions to the starting states of the second
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	614	automaton. By doing so we make the accepting states of the first NFA
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	615	to be non-accepting like so:
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	616
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	617	\begin{center}
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	618	\begin{tikzpicture}[node distance=3mm,
488 598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	619	>=stealth',very thick,
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	620	every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	621	\node[state, initial] (Q_0) {$\mbox{}$};
488 598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	622	\node[state, initial] (Q_01) [below=1mm of Q_0] {$\mbox{}$};
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	623	\node[state, initial] (Q_02) [above=1mm of Q_0] {$\mbox{}$};
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	624	\node (r_1) [right=of Q_0] {$\ldots$};
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	625	\node[state] (t_1) [right=of r_1] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	626	\node[state] (t_2) [above=of t_1] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	627	\node[state] (t_3) [below=of t_1] {$\mbox{}$};
488 598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	628
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	629	\node (A_0) [right=2.5cm of t_1] {$\mbox{}$};
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	630	\node[state] (A_01) [above=1mm of A_0] {$\mbox{}$};
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	631	\node[state] (A_02) [below=1mm of A_0] {$\mbox{}$};
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	632
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	633	\node (b_1) [right=of A_0] {$\ldots$};
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	634	\node[state, accepting] (c_1) [right=of b_1] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	635	\node[state, accepting] (c_2) [above=of c_1] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	636	\node[state, accepting] (c_3) [below=of c_1] {$\mbox{}$};
488 598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	637	\path[->] (t_1) edge (A_01);
492 39b7ff2cf1bc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	638	\path[->] (t_2) edge node [above] {$\epsilon$s} (A_01);
488 598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	639	\path[->] (t_3) edge (A_01);
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	640	\path[->] (t_1) edge (A_02);
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	641	\path[->] (t_2) edge (A_02);
492 39b7ff2cf1bc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	642	\path[->] (t_3) edge node [below] {$\epsilon$s} (A_02);
488 598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	643
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	644	\begin{pgfonlayer}{background}
488 598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	645	\node (3) [rounded corners, inner sep=1mm, thick,
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	646	draw=black!60, fill=black!20, fit= (Q_0) (c_1) (c_2) (c_3)] {};
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	647	\node [yshift=2mm] at (3.north) {$r_1\cdot r_2$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	648	\end{pgfonlayer}
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	649	\end{tikzpicture}
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	650	\end{center}
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	651
489 e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	652	\noindent The idea behind this construction is that the start of any
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	653	string is first recognised by the first NFA, then we silently change
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	654	to the second NFA; the ending of the string is recognised by the
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	655	second NFA...just like matching of a string by the regular expression
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	656	$r_1\cdot r_2$. The Scala code for this construction is given in
753 d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	657	Figure~\ref{thompson2} in Lines 57--65. The starting states of the
489 e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	658	$\epsilon$NFA are the starting states of the first NFA (corresponding
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	659	to $r_1$); the accepting function is the accepting function of the
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	660	second NFA (corresponding to $r_2$). The new transition function is
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	661	all the ``old'' transitions plus the $\epsilon$-transitions connecting
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	662	the accepting states of the first NFA to the starting states of the
753 d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	663	second NFA (Lines 59 and 60). The $\epsilon$NFA is then immediately
489 e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	664	translated in a NFA.
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	665
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	666
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	667	The case for the alternative regular expression $r_1 + r_2$ is
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	668	slightly different: We are given by recursion two NFAs representing
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	669	$r_1$ and $r_2$ respectively. Each NFA has some starting states and
753 d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	670	some accepting states. We obtain a NFA for the regular expression
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	671	$r_1 + r_2$ by composing the transition functions (this crucially
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	672	depends on knowing that the states of each component NFA are
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	673	distinct---recall we implemented for this to hold by some bespoke code
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	674	for \texttt{TState}s). We also need to combine the starting states and
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	675	accepting functions appropriately.
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	676
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	677	\begin{center}
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	678	\begin{tabular}[t]{ccc}
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	679	\begin{tikzpicture}[node distance=3mm,
488 598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	680	>=stealth',very thick,
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	681	every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	682	baseline=(current bounding box.center)]
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	683	\node at (0,0) (1) {$\mbox{}$};
489 e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	684	\node (2) [above=10mm of 1] {};
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	685	\node[state, initial] (4) [above=1mm of 2] {$\mbox{}$};
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	686	\node[state, initial] (5) [below=1mm of 2] {$\mbox{}$};
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	687	\node[state, initial] (3) [below=10mm of 1] {$\mbox{}$};
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	688
489 e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	689	\node (a) [right=of 2] {$\ldots\,$};
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	690	\node (a1) [right=of a] {$$};
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	691	\node[state, accepting] (a2) [above=of a1] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	692	\node[state, accepting] (a3) [below=of a1] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	693
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	694	\node (b) [right=of 3] {$\ldots$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	695	\node[state, accepting] (b1) [right=of b] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	696	\node[state, accepting] (b2) [above=of b1] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	697	\node[state, accepting] (b3) [below=of b1] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	698	\begin{pgfonlayer}{background}
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	699	\node (1) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (2) (a1) (a2) (a3)] {};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	700	\node (2) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (3) (b1) (b2) (b3)] {};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	701	\node [yshift=3mm] at (1.north) {$r_1$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	702	\node [yshift=3mm] at (2.north) {$r_2$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	703	\end{pgfonlayer}
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	704	\end{tikzpicture}
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	705	&
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	706	\mbox{}\qquad\tikz{\draw[>=stealth,line width=2mm,->] (0,0) -- (1, 0)}\quad\mbox{}
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	707	&
489 e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	708	\begin{tikzpicture}[node distance=3mm,
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	709	>=stealth',very thick,
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	710	every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	711	baseline=(current bounding box.center)]
489 e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	712	\node at (0,0) (1) {$\mbox{}$};
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	713	\node (2) [above=10mm of 1] {$$};
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	714	\node[state, initial] (4) [above=1mm of 2] {$\mbox{}$};
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	715	\node[state, initial] (5) [below=1mm of 2] {$\mbox{}$};
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	716	\node[state, initial] (3) [below=10mm of 1] {$\mbox{}$};
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	717
489 e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	718	\node (a) [right=of 2] {$\ldots\,$};
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	719	\node (a1) [right=of a] {$$};
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	720	\node[state, accepting] (a2) [above=of a1] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	721	\node[state, accepting] (a3) [below=of a1] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	722
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	723	\node (b) [right=of 3] {$\ldots$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	724	\node[state, accepting] (b1) [right=of b] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	725	\node[state, accepting] (b2) [above=of b1] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	726	\node[state, accepting] (b3) [below=of b1] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	727
489 e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	728	%\path[->] (1) edge node [above] {$\epsilon$} (2);
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	729	%\path[->] (1) edge node [below] {$\epsilon$} (3);
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	730
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	731	\begin{pgfonlayer}{background}
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	732	\node (3) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (1) (a2) (a3) (b2) (b3)] {};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	733	\node [yshift=3mm] at (3.north) {$r_1+ r_2$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	734	\end{pgfonlayer}
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	735	\end{tikzpicture}
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	736	\end{tabular}
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	737	\end{center}
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	738
489 e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	739	\noindent The code for this construction is in Figure~\ref{thompson2}
753 d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	740	in Lines 67--75.
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	741
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	742	Finally for the $*$-case we have a NFA for $r$ and connect its
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	743	accepting states to a new starting state via
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	744	$\epsilon$-transitions. This new starting state is also an accepting
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	745	state, because $r^*$ can recognise the empty string.
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	746
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	747	\begin{center}
495 7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	748	\begin{tabular}[b]{@{}ccc@{}}
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	749	\begin{tikzpicture}[node distance=3mm,
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	750	>=stealth',very thick,
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	751	every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	752	baseline=(current bounding box.north)]
495 7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	753	\node (2) {$\mbox{}$};
7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	754	\node[state, initial] (4) [above=1mm of 2] {$\mbox{}$};
7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	755	\node[state, initial] (5) [below=1mm of 2] {$\mbox{}$};
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	756	\node (a) [right=of 2] {$\ldots$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	757	\node[state, accepting] (a1) [right=of a] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	758	\node[state, accepting] (a2) [above=of a1] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	759	\node[state, accepting] (a3) [below=of a1] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	760	\begin{pgfonlayer}{background}
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	761	\node (1) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (2) (a1) (a2) (a3)] {};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	762	\node [yshift=3mm] at (1.north) {$r$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	763	\end{pgfonlayer}
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	764	\end{tikzpicture}
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	765	&
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	766	\raisebox{-16mm}{\;\tikz{\draw[>=stealth,line width=2mm,->] (0,0) -- (1, 0)}}
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	767	&
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	768	\begin{tikzpicture}[node distance=3mm,
489 e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	769	>=stealth',very thick,
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	770	every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	771	baseline=(current bounding box.north)]
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	772	\node at (0,0) [state, initial,accepting] (1) {$\mbox{}$};
495 7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	773	\node (2) [right=16mm of 1] {$\mbox{}$};
7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	774	\node[state] (4) [above=1mm of 2] {$\mbox{}$};
7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	775	\node[state] (5) [below=1mm of 2] {$\mbox{}$};
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	776	\node (a) [right=of 2] {$\ldots$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	777	\node[state] (a1) [right=of a] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	778	\node[state] (a2) [above=of a1] {$\mbox{}$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	779	\node[state] (a3) [below=of a1] {$\mbox{}$};
495 7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	780	\path[->] (1) edge node [below] {$\epsilon$} (4);
7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	781	\path[->] (1) edge node [below] {$\epsilon$} (5);
7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	782	\path[->] (a1) edge [bend left=45] node [below] {$\epsilon$} (1);
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	783	\path[->] (a2) edge [bend right] node [below] {$\epsilon$} (1);
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	784	\path[->] (a3) edge [bend left=45] node [below] {$\epsilon$} (1);
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	785	\begin{pgfonlayer}{background}
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	786	\node (2) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (1) (a2) (a3)] {};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	787	\node [yshift=3mm] at (2.north) {$r^*$};
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	788	\end{pgfonlayer}
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	789	\end{tikzpicture}
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	790	\end{tabular}
143 e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	791	\end{center}
e3fd4c5995ef added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 142 diff changeset	792
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	793	\noindent
753 d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	794	The corresponding code is in Figure~\ref{thompson2} in Lines 77--85)
489 e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	795
753 d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	796	To sum up, you can see in the sequence and star cases the need for
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	797	silent $\epsilon$-transitions. Otherwise this construction just
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	798	becomes awkward. Similarly the alternative case shows the need of the
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	799	NFA-nondeterminism. It looks non-obvious to form the `alternative'
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	800	composition of two DFAs, because DFA do not allow several starting and
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	801	successor states. All these constructions can now be put together in
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	802	the following recursive function:
489 e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	803
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	804
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	805	{\small\begin{lstlisting}[language=Scala]
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	806	def thompson(r: Rexp) : NFAt = r match {
488 598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	807	case ZERO => NFA_ZERO()
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	808	case ONE => NFA_ONE()
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	809	case CHAR(c) => NFA_CHAR(c)
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	810	case ALT(r1, r2) => NFA_ALT(thompson(r1), thompson(r2))
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	811	case SEQ(r1, r2) => NFA_SEQ(thompson(r1), thompson(r2))
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	812	case STAR(r1) => NFA_STAR(thompson(r1))
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	813	}
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	814	\end{lstlisting}}
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	815
489 e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	816	\noindent
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	817	It calculates a NFA from a regular expressions. At last we can run
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	818	NFAs for the our evil regular expression examples. The graph on the
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	819	left shows that when translating a regular expression such as
753 d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	820	$a^{?\{n\}} \cdot a^{\{n\}}$ into a NFA, the size can blow up and then
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	821	even the relative fast (on small examples) breadth-first search can be
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	822	slow\ldots the red line maxes out at about 15 $n$s.
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	823
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	824
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	825	The graph on the right shows that with `evil' regular expressions also
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	826	the depth-first search can be abysmally slow. Even if the graphs not
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	827	completely overlap with the curves of Python, Ruby and Java, they are
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	828	similar enough.
489 e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	829
488 598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	830
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	831	\begin{center}
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	832	\begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1mm}}c@{}}
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	833	\begin{tikzpicture}
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	834	\begin{axis}[
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	835	title={Graph: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings
489 e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	836	$\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	837	title style={yshift=-2ex},
489 e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	838	xlabel={$n$},
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	839	x label style={at={(1.05,0.0)}},
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	840	ylabel={time in secs},
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	841	enlargelimits=false,
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	842	xtick={0,5,...,30},
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	843	xmax=33,
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	844	ymax=35,
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	845	ytick={0,5,...,30},
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	846	scaled ticks=false,
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	847	axis lines=left,
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	848	width=5.5cm,
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	849	height=4cm,
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	850	legend entries={Python,Ruby, breadth-first NFA},
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	851	legend style={at={(0.5,-0.25)},anchor=north,font=\small},
489 e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	852	legend cell align=left]
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	853	\addplot[blue,mark=*, mark options={fill=white}] table {re-python.data};
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	854	\addplot[brown,mark=triangle*, mark options={fill=white}] table {re-ruby.data};
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	855	% breath-first search in NFAs
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	856	\addplot[red,mark=*, mark options={fill=white}] table {
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	857	1 0.00586
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	858	2 0.01209
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	859	3 0.03076
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	860	4 0.08269
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	861	5 0.12881
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	862	6 0.25146
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	863	7 0.51377
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	864	8 0.89079
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	865	9 1.62802
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	866	10 3.05326
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	867	11 5.92437
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	868	12 11.67863
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	869	13 24.00568
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	870	};
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	871	\end{axis}
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	872	\end{tikzpicture}
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	873	&
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	874	\begin{tikzpicture}
e28d7a327870 updated Christian Urban <urbanc@in.tum.de> parents: 488 diff changeset	875	\begin{axis}[
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	876	title={Graph: $(a^)^ \cdot b$ and strings
488 598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	877	$\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	878	title style={yshift=-2ex},
488 598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	879	xlabel={$n$},
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	880	x label style={at={(1.05,0.0)}},
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	881	ylabel={time in secs},
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	882	enlargelimits=false,
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	883	xtick={0,5,...,30},
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	884	xmax=33,
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	885	ymax=35,
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	886	ytick={0,5,...,30},
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	887	scaled ticks=false,
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	888	axis lines=left,
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	889	width=5.5cm,
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	890	height=4cm,
753 d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	891	legend entries={Python, Java 8, depth-first NFA},
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	892	legend style={at={(0.5,-0.25)},anchor=north,font=\small},
488 598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	893	legend cell align=left]
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	894	\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	895	\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	896	% depth-first search in NFAs
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	897	\addplot[red,mark=*, mark options={fill=white}] table {
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	898	1 0.00605
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	899	2 0.03086
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	900	3 0.11994
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	901	4 0.45389
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	902	5 2.06192
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	903	6 8.04894
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	904	7 32.63549
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	905	};
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	906	\end{axis}
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	907	\end{tikzpicture}
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	908	\end{tabular}
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	909	\end{center}
598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	910
753 d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	911	\noindent
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	912	OK\ldots now you know why regular expression matchers in those
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	913	languages are sometimes so slow. A bit surprising, don't you agree?
268 18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	914
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	915	\subsection*{Subset Construction}
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	916
491 d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	917	Of course, some developers of regular expression matchers are aware of
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	918	these problems with sluggish NFAs and try to address them. One common
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	919	technique for alleviating the problem I like to show you in this
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	920	section. This will also explain why we insisted on polymorphic types in
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	921	our DFA code (remember I used \texttt{A} and \texttt{C} for the types
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	922	of states and the input, see Figure~\ref{dfa} on
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	923	Page~\pageref{dfa}).\bigskip
268 18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	924
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	925	\noindent
662 8da26d4c2ca8 updated Christian Urban <urbanc@in.tum.de> parents: 578 diff changeset	926	To start, remember that we did not bother with defining and implementing
8da26d4c2ca8 updated Christian Urban <urbanc@in.tum.de> parents: 578 diff changeset	927	$\epsilon$NFAs: we immediately translated them into equivalent NFAs.
8da26d4c2ca8 updated Christian Urban <urbanc@in.tum.de> parents: 578 diff changeset	928	Equivalent in the sense of accepting the same language (though we only
8da26d4c2ca8 updated Christian Urban <urbanc@in.tum.de> parents: 578 diff changeset	929	claimed this and did not prove it rigorously). Remember also that NFAs
8da26d4c2ca8 updated Christian Urban <urbanc@in.tum.de> parents: 578 diff changeset	930	have non-deterministic transitions defined as a relation, or
8da26d4c2ca8 updated Christian Urban <urbanc@in.tum.de> parents: 578 diff changeset	931	alternatively my Scala implementation used transition functions returning sets of
8da26d4c2ca8 updated Christian Urban <urbanc@in.tum.de> parents: 578 diff changeset	932	states. This non-determinism is crucial for the Thompson Construction
8da26d4c2ca8 updated Christian Urban <urbanc@in.tum.de> parents: 578 diff changeset	933	to work (recall the cases for $\cdot$, $+$ and ${}^*$). But this
8da26d4c2ca8 updated Christian Urban <urbanc@in.tum.de> parents: 578 diff changeset	934	non-determinism makes it harder with NFAs to decide when a string is
8da26d4c2ca8 updated Christian Urban <urbanc@in.tum.de> parents: 578 diff changeset	935	accepted or not; whereas such a decision is rather straightforward with
8da26d4c2ca8 updated Christian Urban <urbanc@in.tum.de> parents: 578 diff changeset	936	DFAs: recall their transition function is a ``real'' function that returns
8da26d4c2ca8 updated Christian Urban <urbanc@in.tum.de> parents: 578 diff changeset	937	a single state. So with DFAs we do not have to search at all. What is
8da26d4c2ca8 updated Christian Urban <urbanc@in.tum.de> parents: 578 diff changeset	938	perhaps interesting is the fact that for every NFA we can find a DFA
8da26d4c2ca8 updated Christian Urban <urbanc@in.tum.de> parents: 578 diff changeset	939	that also recognises the same language. This might sound a bit
8da26d4c2ca8 updated Christian Urban <urbanc@in.tum.de> parents: 578 diff changeset	940	paradoxical: NFA $\rightarrow$ decision of acceptance hard; DFA
8da26d4c2ca8 updated Christian Urban <urbanc@in.tum.de> parents: 578 diff changeset	941	$\rightarrow$ decision easy. But this \emph{is} true\ldots but of course
753 d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	942	there is always a caveat---nothing ever is for free in life. Let's see what this
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	943	actually means.
488 598741d39d21 updated Christian Urban <urbanc@in.tum.de> parents: 487 diff changeset	944
491 d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	945	There are actually a number of methods for transforming a NFA into
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	946	an equivalent DFA, but the most famous one is the \emph{subset
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	947	construction}. Consider the following NFA where the states are
491 d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	948	labelled with $0$, $1$ and $2$.
268 18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	949
18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	950	\begin{center}
18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	951	\begin{tabular}{c@{\hspace{10mm}}c}
18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	952	\begin{tikzpicture}[scale=0.7,>=stealth',very thick,
18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	953	every state/.style={minimum size=0pt,
18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	954	draw=blue!50,very thick,fill=blue!20},
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	955	baseline=(current bounding box.center)]
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	956	\node[state,initial] (Q_0) {$0$};
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	957	\node[state] (Q_1) [below=of Q_0] {$1$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	958	\node[state, accepting] (Q_2) [below=of Q_1] {$2$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	959
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	960	\path[->] (Q_0) edge node [right] {$b$} (Q_1);
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	961	\path[->] (Q_1) edge node [right] {$a,b$} (Q_2);
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	962	\path[->] (Q_0) edge [loop above] node {$a, b$} ();
268 18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	963	\end{tikzpicture}
18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	964	&
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	965	\begin{tabular}{r\|ll}
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	966	states & $a$ & $b$\\
268 18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	967	\hline
344 408fd5994288 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 333 diff changeset	968	$\{\}\phantom{\star}$ & $\{\}$ & $\{\}$\\
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	969	start: $\{0\}\phantom{\star}$ & $\{0\}$ & $\{0,1\}$\\
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	970	$\{1\}\phantom{\star}$ & $\{2\}$ & $\{2\}$\\
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	971	$\{2\}\star$ & $\{\}$ & $\{\}$\\
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	972	$\{0,1\}\phantom{\star}$ & $\{0,2\}$ & $\{0,1,2\}$\\
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	973	$\{0,2\}\star$ & $\{0\}$ & $\{0,1\}$\\
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	974	$\{1,2\}\star$ & $\{2\}$ & $\{2\}$\\
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	975	$\{0,1,2\}\star$ & $\{0,2\}$ & $\{0,1,2\}$\\
268 18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	976	\end{tabular}
18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	977	\end{tabular}
18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	978	\end{center}
18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	979
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	980	\noindent The states of the corresponding DFA are given by generating
491 d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	981	all subsets of the set $\{0,1,2\}$ (seen in the states column
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	982	in the table on the right). The other columns define the transition
491 d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	983	function for the DFA for inputs $a$ and $b$. The first row states that
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	984	$\{\}$ is the sink state which has transitions for $a$ and $b$ to
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	985	itself. The next three lines are calculated as follows:
268 18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	986
18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	987	\begin{itemize}
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	988	\item Suppose you calculate the entry for the $a$-transition for state
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	989	$\{0\}$. Look for all states in the NFA that can be reached by such
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	990	a transition from this state; this is only state $0$; therefore from
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	991	state $\{0\}$ we can go to state $\{0\}$ via an $a$-transition.
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	992	\item Do the same for the $b$-transition; you can reach states $0$ and
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	993	$1$ in the NFA; therefore in the DFA we can go from state $\{0\}$ to
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	994	state $\{0,1\}$ via an $b$-transition.
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	995	\item Continue with the states $\{1\}$ and $\{2\}$.
268 18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	996	\end{itemize}
18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	997
491 d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	998	\noindent
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	999	Once you filled in the transitions for `simple' states $\{0\}$
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1000	.. $\{2\}$, you only have to build the union for the compound states
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1001	$\{0,1\}$, $\{0,2\}$ and so on. For example for $\{0,1\}$ you take the
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1002	union of Line $\{0\}$ and Line $\{1\}$, which gives $\{0,2\}$ for $a$,
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1003	and $\{0,1,2\}$ for $b$. And so on.
344 408fd5994288 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 333 diff changeset	1004
491 d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1005	The starting state of the DFA can be calculated from the starting
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1006	states of the NFA, that is in this case $\{0\}$. But in general there
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1007	can of course be many starting states in the NFA and you would take
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1008	the corresponding subset as \emph{the} starting state of the DFA.
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1009
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1010	The accepting states in the DFA are given by all sets that contain a
667 412556272333 updated Christian Urban <urbanc@in.tum.de> parents: 662 diff changeset	1011	$2$, which is the only accepting state in this NFA. But again in
491 d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1012	general if the subset contains any accepting state from the NFA, then
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1013	the corresponding state in the DFA is accepting as well. This
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1014	completes the subset construction. The corresponding DFA for the NFA
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1015	shown above is:
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1016
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1017	\begin{equation}
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1018	\begin{tikzpicture}[scale=0.8,>=stealth',very thick,
344 408fd5994288 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 333 diff changeset	1019	every state/.style={minimum size=0pt,
491 d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1020	draw=blue!50,very thick,fill=blue!20},
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1021	baseline=(current bounding box.center)]
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1022	\node[state,initial] (q0) {$0$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1023	\node[state] (q01) [right=of q0] {$0,1$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1024	\node[state,accepting] (q02) [below=of q01] {$0,2$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1025	\node[state,accepting] (q012) [right=of q02] {$0,1,2$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1026	\node[state] (q1) [below=0.5cm of q0] {$1$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1027	\node[state,accepting] (q2) [below=1cm of q1] {$2$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1028	\node[state] (qn) [below left=1cm of q2] {$\{\}$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1029	\node[state,accepting] (q12) [below right=1cm of q2] {$1,2$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1030
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1031	\path[->] (q0) edge node [above] {$b$} (q01);
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1032	\path[->] (q01) edge node [above] {$b$} (q012);
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1033	\path[->] (q0) edge [loop above] node {$a$} ();
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1034	\path[->] (q012) edge [loop right] node {$b$} ();
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1035	\path[->] (q012) edge node [below] {$a$} (q02);
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1036	\path[->] (q02) edge node [below] {$a$} (q0);
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1037	\path[->] (q01) edge [bend left] node [left] {$a$} (q02);
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1038	\path[->] (q02) edge [bend left] node [right] {$b$} (q01);
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1039	\path[->] (q1) edge node [left] {$a,b$} (q2);
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1040	\path[->] (q12) edge node [right] {$a, b$} (q2);
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1041	\path[->] (q2) edge node [right] {$a, b$} (qn);
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1042	\path[->] (qn) edge [loop left] node {$a,b$} ();
491 d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1043	\end{tikzpicture}\label{subsetdfa}
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1044	\end{equation}
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1045
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1046	\noindent
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1047	Please check that this is indeed a DFA. The big question is whether
491 d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1048	this DFA can recognise the same language as the NFA we started with?
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1049	I let you ponder about this question.
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1050
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1051
491 d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1052	There are also two points to note: One is that very often in the
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1053	subset construction the resulting DFA contains a number of ``dead''
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1054	states that are never reachable from the starting state. This is
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1055	obvious in the example, where state $\{1\}$, $\{2\}$, $\{1,2\}$ and
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1056	$\{\}$ can never be reached from the starting state. But this might
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1057	not always be as obvious as that. In effect the DFA in this example is
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1058	not a \emph{minimal} DFA (more about this in a minute). Such dead
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1059	states can be safely removed without changing the language that is
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1060	recognised by the DFA. Another point is that in some cases, however,
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1061	the subset construction produces a DFA that does \emph{not} contain
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1062	any dead states\ldots{}this means it calculates a minimal DFA. Which
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1063	in turn means that in some cases the number of states can by going
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1064	from NFAs to DFAs exponentially increase, namely by $2^n$ (which is
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1065	the number of subsets you can form for sets of $n$ states). This blow
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1066	up in the number of states in the DFA is again bad news for how
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1067	quickly you can decide whether a string is accepted by a DFA or
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1068	not. So the caveat with DFAs is that they might make the task of
667 412556272333 updated Christian Urban <urbanc@in.tum.de> parents: 662 diff changeset	1069	finding the next state trivial, but might require $2^n$ times as many
874 ffe02fd574a5 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 764 diff changeset	1070	states than a NFA.\bigskip
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1071
491 d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1072	\noindent
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1073	To conclude this section, how conveniently we can
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1074	implement the subset construction with our versions of NFAs and
698 7c7854feccb5 updated Christian Urban <urbanc@in.tum.de> parents: 667 diff changeset	1075	DFAs? Very conveniently. The code is just:
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1076
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1077	{\small\begin{lstlisting}[language=Scala]
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1078	def subset[A, C](nfa: NFA[A, C]) : DFA[Set[A], C] = {
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1079	DFA(nfa.starts,
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1080	{ case (qs, c) => nfa.nexts(qs, c) },
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1081	_.exists(nfa.fins))
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1082	}
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1083	\end{lstlisting}}
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1084
491 d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1085	\noindent
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1086	The interesting point in this code is that the state type of the
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1087	calculated DFA is \texttt{Set[A]}. Think carefully that this works out
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1088	correctly.
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1089
491 d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1090	The DFA is then given by three components: the starting states, the
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1091	transition function and the accepting-states function. The starting
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1092	states are a set in the given NFA, but a single state in the DFA. The
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1093	transition function, given the state \texttt{qs} and input \texttt{c},
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1094	needs to produce the next state: this is the set of all NFA states
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1095	that are reachable from each state in \texttt{qs}. The function
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1096	\texttt{nexts} from the NFA class already calculates this for us. The
667 412556272333 updated Christian Urban <urbanc@in.tum.de> parents: 662 diff changeset	1097	accepting-states function for the DFA is true whenever at least one
491 d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1098	state in the subset is accepting (that is true) in the NFA.\medskip
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1099
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1100	\noindent
495 7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	1101	You might be able to spend some quality time tinkering with this code
7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	1102	and time to ponder about it. Then you will probably notice that it is
7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	1103	actually a bit silly. The whole point of translating the NFA into a
7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	1104	DFA via the subset construction is to make the decision of whether a
7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	1105	string is accepted or not faster. Given the code above, the generated
7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	1106	DFA will be exactly as fast, or as slow, as the NFA we started with
7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	1107	(actually it will even be a tiny bit slower). The reason is that we
7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	1108	just re-use the \texttt{nexts} function from the NFA. This function
7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	1109	implements the non-deterministic breadth-first search. You might be
7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	1110	thinking: This is cheating! \ldots{} Well, not quite as you will see
7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	1111	later, but in terms of speed we still need to work a bit in order to
7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	1112	get sometimes(!) a faster DFA. Let's do this next.
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1113
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1114	\subsection*{DFA Minimisation}
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1115
491 d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1116	As seen in \eqref{subsetdfa}, the subset construction from NFA to a
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1117	DFA can result in a rather ``inefficient'' DFA. Meaning there are
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1118	states that are not needed. There are two kinds of such unneeded
495 7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	1119	states: \emph{unreachable} states and \emph{non-distinguishable}
491 d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1120	states. The first kind of states can just be removed without affecting
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1121	the language that can be recognised (after all they are
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1122	unreachable). The second kind can also be recognised and thus a DFA
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1123	can be \emph{minimised} by the following algorithm:
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1124
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1125	\begin{enumerate}
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1126	\item Take all pairs $(q, p)$ with $q \not= p$
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1127	\item Mark all pairs that accepting and non-accepting states
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1128	\item For all unmarked pairs $(q, p)$ and all characters $c$
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1129	test whether
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1130
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1131	\begin{center}
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1132	$(\delta(q, c), \delta(p,c))$
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1133	\end{center}
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1134
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1135	are marked. If there is one, then also mark $(q, p)$.
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1136	\item Repeat last step until no change.
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1137	\item All unmarked pairs can be merged.
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1138	\end{enumerate}
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1139
491 d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1140	\noindent Unfortunately, once we throw away all unreachable states in
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1141	\eqref{subsetdfa}, all remaining states are needed. In order to
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1142	illustrate the minimisation algorithm, consider the following DFA.
344 408fd5994288 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 333 diff changeset	1143
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1144	\begin{center}
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1145	\begin{tikzpicture}[>=stealth',very thick,auto,
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1146	every state/.style={minimum size=0pt,
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1147	inner sep=2pt,draw=blue!50,very thick,
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1148	fill=blue!20}]
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1149	\node[state,initial] (Q_0) {$Q_0$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1150	\node[state] (Q_1) [right=of Q_0] {$Q_1$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1151	\node[state] (Q_2) [below right=of Q_0] {$Q_2$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1152	\node[state] (Q_3) [right=of Q_2] {$Q_3$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1153	\node[state, accepting] (Q_4) [right=of Q_1] {$Q_4$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1154	\path[->] (Q_0) edge node [above] {$a$} (Q_1);
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1155	\path[->] (Q_1) edge node [above] {$a$} (Q_4);
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1156	\path[->] (Q_4) edge [loop right] node {$a, b$} ();
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1157	\path[->] (Q_3) edge node [right] {$a$} (Q_4);
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1158	\path[->] (Q_2) edge node [above] {$a$} (Q_3);
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1159	\path[->] (Q_1) edge node [right] {$b$} (Q_2);
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1160	\path[->] (Q_0) edge node [above] {$b$} (Q_2);
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1161	\path[->] (Q_2) edge [loop left] node {$b$} ();
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1162	\path[->] (Q_3) edge [bend left=95, looseness=1.3] node
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1163	[below] {$b$} (Q_0);
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1164	\end{tikzpicture}
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1165	\end{center}
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1166
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1167	\noindent In Step 1 and 2 we consider essentially a triangle
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1168	of the form
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1169
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1170	\begin{center}
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1171	\begin{tikzpicture}[scale=0.6,line width=0.8mm]
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1172	\draw (0,0) -- (4,0);
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1173	\draw (0,1) -- (4,1);
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1174	\draw (0,2) -- (3,2);
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1175	\draw (0,3) -- (2,3);
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1176	\draw (0,4) -- (1,4);
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1177
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1178	\draw (0,0) -- (0, 4);
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1179	\draw (1,0) -- (1, 4);
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1180	\draw (2,0) -- (2, 3);
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1181	\draw (3,0) -- (3, 2);
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1182	\draw (4,0) -- (4, 1);
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1183
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1184	\draw (0.5,-0.5) node {$Q_0$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1185	\draw (1.5,-0.5) node {$Q_1$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1186	\draw (2.5,-0.5) node {$Q_2$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1187	\draw (3.5,-0.5) node {$Q_3$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1188
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1189	\draw (-0.5, 3.5) node {$Q_1$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1190	\draw (-0.5, 2.5) node {$Q_2$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1191	\draw (-0.5, 1.5) node {$Q_3$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1192	\draw (-0.5, 0.5) node {$Q_4$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1193
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1194	\draw (0.5,0.5) node {\large$\star$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1195	\draw (1.5,0.5) node {\large$\star$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1196	\draw (2.5,0.5) node {\large$\star$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1197	\draw (3.5,0.5) node {\large$\star$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1198	\end{tikzpicture}
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1199	\end{center}
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1200
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1201	\noindent where the lower row is filled with stars, because in
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1202	the corresponding pairs there is always one state that is
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1203	accepting ($Q_4$) and a state that is non-accepting (the other
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1204	states).
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1205
491 d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1206	In Step 3 we need to fill in more stars according whether
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1207	one of the next-state pairs are marked. We have to do this
753 d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	1208	for every unmarked field until there is no change any more.
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1209	This gives the triangle
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1210
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1211	\begin{center}
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1212	\begin{tikzpicture}[scale=0.6,line width=0.8mm]
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1213	\draw (0,0) -- (4,0);
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1214	\draw (0,1) -- (4,1);
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1215	\draw (0,2) -- (3,2);
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1216	\draw (0,3) -- (2,3);
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1217	\draw (0,4) -- (1,4);
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1218
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1219	\draw (0,0) -- (0, 4);
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1220	\draw (1,0) -- (1, 4);
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1221	\draw (2,0) -- (2, 3);
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1222	\draw (3,0) -- (3, 2);
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1223	\draw (4,0) -- (4, 1);
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1224
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1225	\draw (0.5,-0.5) node {$Q_0$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1226	\draw (1.5,-0.5) node {$Q_1$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1227	\draw (2.5,-0.5) node {$Q_2$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1228	\draw (3.5,-0.5) node {$Q_3$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1229
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1230	\draw (-0.5, 3.5) node {$Q_1$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1231	\draw (-0.5, 2.5) node {$Q_2$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1232	\draw (-0.5, 1.5) node {$Q_3$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1233	\draw (-0.5, 0.5) node {$Q_4$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1234
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1235	\draw (0.5,0.5) node {\large$\star$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1236	\draw (1.5,0.5) node {\large$\star$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1237	\draw (2.5,0.5) node {\large$\star$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1238	\draw (3.5,0.5) node {\large$\star$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1239	\draw (0.5,1.5) node {\large$\star$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1240	\draw (2.5,1.5) node {\large$\star$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1241	\draw (0.5,3.5) node {\large$\star$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1242	\draw (1.5,2.5) node {\large$\star$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1243	\end{tikzpicture}
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1244	\end{center}
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1245
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1246	\noindent which means states $Q_0$ and $Q_2$, as well as $Q_1$
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1247	and $Q_3$ can be merged. This gives the following minimal DFA
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1248
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1249	\begin{center}
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1250	\begin{tikzpicture}[>=stealth',very thick,auto,
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1251	every state/.style={minimum size=0pt,
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1252	inner sep=2pt,draw=blue!50,very thick,
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1253	fill=blue!20}]
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1254	\node[state,initial] (Q_02) {$Q_{0, 2}$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1255	\node[state] (Q_13) [right=of Q_02] {$Q_{1, 3}$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1256	\node[state, accepting] (Q_4) [right=of Q_13]
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1257	{$Q_{4\phantom{,0}}$};
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1258	\path[->] (Q_02) edge [bend left] node [above] {$a$} (Q_13);
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1259	\path[->] (Q_13) edge [bend left] node [below] {$b$} (Q_02);
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1260	\path[->] (Q_02) edge [loop below] node {$b$} ();
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1261	\path[->] (Q_13) edge node [above] {$a$} (Q_4);
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1262	\path[->] (Q_4) edge [loop above] node {$a, b$} ();
344 408fd5994288 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 333 diff changeset	1263	\end{tikzpicture}
408fd5994288 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 333 diff changeset	1264	\end{center}
408fd5994288 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 333 diff changeset	1265
408fd5994288 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 333 diff changeset	1266
753 d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	1267	\noindent This minimised DFA is certainly fast when it comes deciding
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	1268	whether a string is accepted or not. But this is not universally the
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	1269	case. Suppose you count the nodes in a regular expression (when
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	1270	represented as tree). If you look carefully at the Thompson
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	1271	Construction you can see that the constructed NFA has states that grow
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	1272	linearly in terms of the size of the regular expression. This is good,
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	1273	but as we have seen earlier deciding whether a string is matched by an
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	1274	NFA is hard. Translating an NFA into a DFA means deciding whether a
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	1275	string is matched by a DFA is easy, but the number of states can grow
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	1276	exponentially, even after minimisation. Say a NFA has $n$ states, then
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	1277	in the worst case the corresponding minimal DFA that can match the
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	1278	same language as the NFA might contain $2^n$ of states. Unfortunately
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	1279	in many interesting cases this worst case bound is the dominant
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	1280	factor.
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	1281
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	1282
492 39b7ff2cf1bc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	1283	By the way, we are not bothering with implementing the above
667 412556272333 updated Christian Urban <urbanc@in.tum.de> parents: 662 diff changeset	1284	minimisation algorithm: while up to now all the transformations used
492 39b7ff2cf1bc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	1285	some clever composition of functions, the minimisation algorithm
39b7ff2cf1bc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	1286	cannot be implemented by just composing some functions. For this we
39b7ff2cf1bc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	1287	would require a more concrete representation of the transition
39b7ff2cf1bc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	1288	function (like maps). If we did this, however, then many advantages of
39b7ff2cf1bc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	1289	the functions would be thrown away. So the compromise is to not being
753 d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	1290	able to minimise (easily) our DFAs. We want to use regular expressions
d94fdbef1a4f updated Christian Urban <christian.urban@kcl.ac.uk> parents: 698 diff changeset	1291	directly anyway.
492 39b7ff2cf1bc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	1292
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1293	\subsection*{Brzozowski's Method}
269 83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1294
495 7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	1295	I know this handout is already a long, long rant: but after all it is
7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	1296	a topic that has been researched for more than 60 years. If you
7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	1297	reflect on what you have read so far, the story is that you can take a
7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	1298	regular expression, translate it via the Thompson Construction into an
491 d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1299	$\epsilon$NFA, then translate it into a NFA by removing all
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1300	$\epsilon$-transitions, and then via the subset construction obtain a
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1301	DFA. In all steps we made sure the language, or which strings can be
698 7c7854feccb5 updated Christian Urban <urbanc@in.tum.de> parents: 667 diff changeset	1302	recognised, stays the same. Of cause we should have proved this in
495 7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	1303	each step, but let us cut corners here. After the last section, we
7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	1304	can even minimise the DFA (maybe not in code). But again we made sure
7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	1305	the same language is recognised. You might be wondering: Can we go
7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	1306	into the other direction? Can we go from a DFA and obtain a regular
7d9d86dc7aa0 updated Christian Urban <urbanc@in.tum.de> parents: 492 diff changeset	1307	expression that can recognise the same language as the DFA?\medskip
491 d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1308
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1309	\noindent
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1310	The answer is yes. Again there are several methods for calculating a
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1311	regular expression for a DFA. I will show you Brzozowski's method
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1312	because it calculates a regular expression using quite familiar
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1313	transformations for solving equational systems. Consider the DFA:
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1314
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1315	\begin{center}
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1316	\begin{tikzpicture}[scale=1.5,>=stealth',very thick,auto,
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1317	every state/.style={minimum size=0pt,
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1318	inner sep=2pt,draw=blue!50,very thick,
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1319	fill=blue!20}]
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1320	\node[state, initial] (q0) at ( 0,1) {$Q_0$};
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1321	\node[state] (q1) at ( 1,1) {$Q_1$};
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1322	\node[state, accepting] (q2) at ( 2,1) {$Q_2$};
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1323	\path[->] (q0) edge[bend left] node[above] {$a$} (q1)
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1324	(q1) edge[bend left] node[above] {$b$} (q0)
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1325	(q2) edge[bend left=50] node[below] {$b$} (q0)
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1326	(q1) edge node[above] {$a$} (q2)
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1327	(q2) edge [loop right] node {$a$} ()
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1328	(q0) edge [loop below] node {$b$} ();
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1329	\end{tikzpicture}
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1330	\end{center}
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1331
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1332	\noindent for which we can set up the following equational
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1333	system
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1334
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1335	\begin{eqnarray}
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1336	Q_0 & = & \ONE + Q_0\,b + Q_1\,b + Q_2\,b\\
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1337	Q_1 & = & Q_0\,a\\
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1338	Q_2 & = & Q_1\,a + Q_2\,a
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1339	\end{eqnarray}
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1340
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1341	\noindent There is an equation for each node in the DFA. Let
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1342	us have a look how the right-hand sides of the equations are
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1343	constructed. First have a look at the second equation: the
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1344	left-hand side is $Q_1$ and the right-hand side $Q_0\,a$. The
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1345	right-hand side is essentially all possible ways how to end up
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1346	in node $Q_1$. There is only one incoming edge from $Q_0$ consuming
322 698ed1c96cd0 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 318 diff changeset	1347	an $a$. Therefore the right hand side is this
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1348	state followed by character---in this case $Q_0\,a$. Now lets
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1349	have a look at the third equation: there are two incoming
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1350	edges for $Q_2$. Therefore we have two terms, namely $Q_1\,a$ and
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1351	$Q_2\,a$. These terms are separated by $+$. The first states
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1352	that if in state $Q_1$ consuming an $a$ will bring you to
485 bd6d999ab7b6 updated Christian Urban <urbanc@in.tum.de> parents: 484 diff changeset	1353	$Q_2$, and the second that being in $Q_2$ and consuming an $a$
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1354	will make you stay in $Q_2$. The right-hand side of the
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1355	first equation is constructed similarly: there are three
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1356	incoming edges, therefore there are three terms. There is
444 3056a4c071b0 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 349 diff changeset	1357	one exception in that we also ``add'' $\ONE$ to the
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1358	first equation, because it corresponds to the starting state
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1359	in the DFA.
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1360
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1361	Having constructed the equational system, the question is
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1362	how to solve it? Remarkably the rules are very similar to
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1363	solving usual linear equational systems. For example the
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1364	second equation does not contain the variable $Q_1$ on the
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1365	right-hand side of the equation. We can therefore eliminate
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1366	$Q_1$ from the system by just substituting this equation
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1367	into the other two. This gives
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1368
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1369	\begin{eqnarray}
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1370	Q_0 & = & \ONE + Q_0\,b + Q_0\,a\,b + Q_2\,b\\
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1371	Q_2 & = & Q_0\,a\,a + Q_2\,a
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1372	\end{eqnarray}
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1373
698 7c7854feccb5 updated Christian Urban <urbanc@in.tum.de> parents: 667 diff changeset	1374	\noindent where in Equation (6) we have two occurrences
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1375	of $Q_0$. Like the laws about $+$ and $\cdot$, we can simplify
698 7c7854feccb5 updated Christian Urban <urbanc@in.tum.de> parents: 667 diff changeset	1376	Equation (6) to obtain the following two equations:
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1377
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1378	\begin{eqnarray}
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1379	Q_0 & = & \ONE + Q_0\,(b + a\,b) + Q_2\,b\\
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1380	Q_2 & = & Q_0\,a\,a + Q_2\,a
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1381	\end{eqnarray}
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1382
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1383	\noindent Unfortunately we cannot make any more progress with
578 6e5e3adc9eb1 updated Christian Urban <urbanc@in.tum.de> parents: 573 diff changeset	1384	substituting equations, because both (8) and (9) contain the
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1385	variable on the left-hand side also on the right-hand side.
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1386	Here we need to now use a law that is different from the usual
349 434891622131 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 344 diff changeset	1387	laws about linear equations. It is called \emph{Arden's rule}.
434891622131 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 344 diff changeset	1388	It states that if an equation is of the form $q = q\,r + s$
434891622131 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 344 diff changeset	1389	then it can be transformed to $q = s\, r^*$. Since we can
578 6e5e3adc9eb1 updated Christian Urban <urbanc@in.tum.de> parents: 573 diff changeset	1390	assume $+$ is symmetric, Equation (9) is of that form: $s$ is
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1391	$Q_0\,a\,a$ and $r$ is $a$. That means we can transform
578 6e5e3adc9eb1 updated Christian Urban <urbanc@in.tum.de> parents: 573 diff changeset	1392	(9) to obtain the two new equations
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1393
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1394	\begin{eqnarray}
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1395	Q_0 & = & \ONE + Q_0\,(b + a\,b) + Q_2\,b\\
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1396	Q_2 & = & Q_0\,a\,a\,(a^*)
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1397	\end{eqnarray}
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1398
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1399	\noindent Now again we can substitute the second equation into
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1400	the first in order to eliminate the variable $Q_2$.
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1401
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1402	\begin{eqnarray}
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1403	Q_0 & = & \ONE + Q_0\,(b + a\,b) + Q_0\,a\,a\,(a^*)\,b
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1404	\end{eqnarray}
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1405
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1406	\noindent Pulling $Q_0$ out as a single factor gives:
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1407
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1408	\begin{eqnarray}
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1409	Q_0 & = & \ONE + Q_0\,(b + a\,b + a\,a\,(a^*)\,b)
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1410	\end{eqnarray}
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1411
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1412	\noindent This equation is again of the form so that we can
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1413	apply Arden's rule ($r$ is $b + a\,b + a\,a\,(a^*)\,b$ and $s$
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1414	is $\ONE$). This gives as solution for $Q_0$ the following
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1415	regular expression:
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1416
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1417	\begin{eqnarray}
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1418	Q_0 & = & \ONE\,(b + a\,b + a\,a\,(a^)\,b)^
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1419	\end{eqnarray}
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1420
349 434891622131 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 344 diff changeset	1421	\noindent Since this is a regular expression, we can simplify
444 3056a4c071b0 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 349 diff changeset	1422	away the $\ONE$ to obtain the slightly simpler regular
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1423	expression
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1424
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1425	\begin{eqnarray}
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1426	Q_0 & = & (b + a\,b + a\,a\,(a^)\,b)^
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1427	\end{eqnarray}
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1428
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1429	\noindent
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1430	Now we can unwind this process and obtain the solutions
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1431	for the other equations. This gives:
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1432
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1433	\begin{eqnarray}
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1434	Q_0 & = & (b + a\,b + a\,a\,(a^)\,b)^\\
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1435	Q_1 & = & (b + a\,b + a\,a\,(a^)\,b)^\,a\\
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1436	Q_2 & = & (b + a\,b + a\,a\,(a^)\,b)^\,a\,a\,(a)^*
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1437	\end{eqnarray}
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1438
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1439	\noindent Finally, we only need to ``add'' up the equations
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1440	which correspond to a terminal state. In our running example,
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1441	this is just $Q_2$. Consequently, a regular expression
491 d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1442	that recognises the same language as the DFA is
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1443
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1444	\[
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1445	(b + a\,b + a\,a\,(a^)\,b)^\,a\,a\,(a)^*
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1446	\]
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1447
491 d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1448	\noindent You can somewhat crosscheck your solution by taking a string
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1449	the regular expression can match and and see whether it can be matched
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1450	by the DFA. One string for example is $aaa$ and \emph{voila} this
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1451	string is also matched by the automaton.
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1452
491 d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1453	We should prove that Brzozowski's method really produces an equivalent
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1454	regular expression. But for the purposes of this module, we omit
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1455	this. I guess you are relieved.
269 83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1456
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1457
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1458	\subsection*{Regular Languages}
269 83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1459
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1460	Given the constructions in the previous sections we obtain
349 434891622131 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 344 diff changeset	1461	the following overall picture:
269 83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1462
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1463	\begin{center}
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1464	\begin{tikzpicture}
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1465	\node (rexp) {\bf Regexps};
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1466	\node (nfa) [right=of rexp] {\bf NFAs};
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1467	\node (dfa) [right=of nfa] {\bf DFAs};
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1468	\node (mdfa) [right=of dfa] {\bf\begin{tabular}{c}minimal\\ DFAs\end{tabular}};
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1469	\path[->,line width=1mm] (rexp) edge node [above=4mm, black] {\begin{tabular}{c@{\hspace{9mm}}}Thompson's\\[-1mm] construction\end{tabular}} (nfa);
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1470	\path[->,line width=1mm] (nfa) edge node [above=4mm, black] {\begin{tabular}{c}subset\\[-1mm] construction\end{tabular}}(dfa);
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1471	\path[->,line width=1mm] (dfa) edge node [below=5mm, black] {minimisation} (mdfa);
344 408fd5994288 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 333 diff changeset	1472	\path[->,line width=1mm] (dfa) edge [bend left=45] node [below] {\begin{tabular}{l}Brzozowski's\\ method\end{tabular}} (rexp);
269 83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1473	\end{tikzpicture}
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1474	\end{center}
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1475
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1476	\noindent By going from regular expressions over NFAs to DFAs,
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1477	we can always ensure that for every regular expression there
349 434891622131 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 344 diff changeset	1478	exists a NFA and a DFA that can recognise the same language.
269 83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1479	Although we did not prove this fact. Similarly by going from
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1480	DFAs to regular expressions, we can make sure for every DFA
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1481	there exists a regular expression that can recognise the same
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1482	language. Again we did not prove this fact.
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1483
491 d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1484	The fundamental conclusion we can draw is that automata and regular
269 83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1485	expressions can recognise the same set of languages:
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1486
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1487	\begin{quote} A language is \emph{regular} iff there exists a
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1488	regular expression that recognises all its strings.
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1489	\end{quote}
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1490
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1491	\noindent or equivalently
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1492
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1493	\begin{quote} A language is \emph{regular} iff there exists an
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1494	automaton that recognises all its strings.
83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1495	\end{quote}
268 18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	1496
698 7c7854feccb5 updated Christian Urban <urbanc@in.tum.de> parents: 667 diff changeset	1497	\noindent Note that this is not a statement for a particular language
491 d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1498	(that is a particular set of strings), but about a large class of
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1499	languages, namely the regular ones.
268 18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	1500
491 d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1501	As a consequence for deciding whether a string is recognised by a
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1502	regular expression, we could use our algorithm based on derivatives or
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1503	NFAs or DFAs. But let us quickly look at what the differences mean in
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1504	computational terms. Translating a regular expression into a NFA gives
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1505	us an automaton that has $O(n)$ states---that means the size of the
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1506	NFA grows linearly with the size of the regular expression. The
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1507	problem with NFAs is that the problem of deciding whether a string is
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1508	accepted or not is computationally not cheap. Remember with NFAs we
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1509	have potentially many next states even for the same input and also
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1510	have the silent $\epsilon$-transitions. If we want to find a path from
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1511	the starting state of a NFA to an accepting state, we need to consider
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1512	all possibilities. In Ruby, Python and Java this is done by a
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1513	depth-first search, which in turn means that if a ``wrong'' choice is
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1514	made, the algorithm has to backtrack and thus explore all potential
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1515	candidates. This is exactly the reason why Ruby, Python and Java are
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1516	so slow for evil regular expressions. An alternative to the
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1517	potentially slow depth-first search is to explore the search space in
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1518	a breadth-first fashion, but this might incur a big memory penalty.
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1519
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1520	To avoid the problems with NFAs, we can translate them into DFAs. With
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1521	DFAs the problem of deciding whether a string is recognised or not is
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1522	much simpler, because in each state it is completely determined what
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1523	the next state will be for a given input. So no search is needed. The
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1524	problem with this is that the translation to DFAs can explode
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1525	exponentially the number of states. Therefore when this route is
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1526	taken, we definitely need to minimise the resulting DFAs in order to
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1527	have an acceptable memory and runtime behaviour. But remember the
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1528	subset construction in the worst case explodes the number of states by
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1529	$2^n$. Effectively also the translation to DFAs can incur a big
349 434891622131 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 344 diff changeset	1530	runtime penalty.
269 83e6cb90216d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 268 diff changeset	1531
491 d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1532	But this does not mean that everything is bad with automata. Recall
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1533	the problem of finding a regular expressions for the language that is
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1534	\emph{not} recognised by a regular expression. In our implementation
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1535	we added explicitly such a regular expressions because they are useful
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1536	for recognising comments. But in principle we did not need to. The
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1537	argument for this is as follows: take a regular expression, translate
349 434891622131 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 344 diff changeset	1538	it into a NFA and then a DFA that both recognise the same
491 d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1539	language. Once you have the DFA it is very easy to construct the
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1540	automaton for the language not recognised by a DFA. If the DFA is
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1541	completed (this is important!), then you just need to exchange the
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1542	accepting and non-accepting states. You can then translate this DFA
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1543	back into a regular expression and that will be the regular expression
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1544	that can match all strings the original regular expression could
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1545	\emph{not} match.
268 18bef085a7ca updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 251 diff changeset	1546
491 d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1547	It is also interesting that not all languages are regular. The most
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1548	well-known example of a language that is not regular consists of all
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1549	the strings of the form
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1550
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1551	\[a^n\,b^n\]
7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 270 diff changeset	1552
491 d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1553	\noindent meaning strings that have the same number of $a$s and
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1554	$b$s. You can try, but you cannot find a regular expression for this
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1555	language and also not an automaton. One can actually prove that there
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1556	is no regular expression nor automaton for this language, but again
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1557	that would lead us too far afield for what we want to do in this
d5776c6018f0 updated Christian Urban <urbanc@in.tum.de> parents: 490 diff changeset	1558	module.
270 4dbeaf43031d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 269 diff changeset	1559
492 39b7ff2cf1bc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	1560
39b7ff2cf1bc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	1561	\subsection*{Where Have Derivatives Gone?}
39b7ff2cf1bc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	1562
764 6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1563	%%Still to be done\bigskip
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1564
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1565	\noindent
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1566	By now you are probably fed up with this text. It is now way too long
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1567	for one lecture, but there is still one aspect of the
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1568	automata-regular-expression-connection I like to describe:\medskip
518 aecbe0077f2d updated cu parents: 497 diff changeset	1569
764 6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1570	\noindent
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1571	Where have the derivatives gone? Did we just forget them? Well, they
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1572	actually do play a role of generating a DFA from a regular expression.
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1573	And we can also see this in our implementation\ldots{}because there is
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1574	one flaw in our representation of automata and transitions as partial
874 ffe02fd574a5 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 764 diff changeset	1575	functions....remember I said something about fishy things.
ffe02fd574a5 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 764 diff changeset	1576	Namely, we can represent automata with infinite states, which is
764 6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1577	actually forbidden by the definition of what an automaton is. We can
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1578	exploit this flaw as follows: Suppose our alphabet consists of the
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1579	characters $c_1$ to $c_n$. Then we can generate an ``automaton''
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1580	(it is not really one because it has infinite states) by taking
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1581	as starting state the regular expression $r$ for which we
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1582	want to generate an automaton. There are $n$ next-states which
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1583	corresponds to the derivatives of $r$ according to $c_1$ to $c_n$.
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1584	Implementing this in our slightly ``flawed'' representation is
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1585	not too difficult. This will give a picture for the ``automaton''
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1586	looking something like this, except that it extends infinitely
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1587	far to the right:
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1588
492 39b7ff2cf1bc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	1589
764 6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1590	\begin{center}
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1591	\begin{tikzpicture}
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1592	[level distance=25mm,very thick,auto,
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1593	level 1/.style={sibling distance=30mm},
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1594	level 2/.style={sibling distance=15mm},
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1595	every node/.style={minimum size=30pt,
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1596	inner sep=0pt,circle,draw=blue!50,very thick,
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1597	fill=blue!20}]
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1598	\node {$r$} [grow=right]
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1599	child[->] {node (cn) {$d_{c_n}$}
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1600	child { node {$dd_{c_nc_n}$}}
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1601	child { node {$dd_{c_nc_1}$}}
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1602	%edge from parent node[left] {$c_n$}
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1603	}
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1604	child[->] {node (c1) {$d_{c_1}$}
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1605	child { node {$dd_{c_1c_n}$}}
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1606	child { node {$dd_{c_1c_1}$}}
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1607	%edge from parent node[left] {$c_1$}
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1608	};
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1609	%%\draw (cn) -- (c1) node {\vdots};
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1610	\end{tikzpicture}
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1611	\end{center}
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1612
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1613	\noindent
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1614	I let you implement this ``automaton''.
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1615
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1616	While this makes all sense (modulo the flaw with the infinite states),
874 ffe02fd574a5 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 764 diff changeset	1617	does this automaton teach us anything new? The answer is no, because it
764 6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1618	boils down to just another implementation of the Brzozowski
874 ffe02fd574a5 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 764 diff changeset	1619	algorithm from Lecture 2. There \emph{is} however something interesting
ffe02fd574a5 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 764 diff changeset	1620	in this construction
764 6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1621	which Brzozowski already cleverly found out, because there is
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1622	a way to restrict the number of states to something finite.
874 ffe02fd574a5 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 764 diff changeset	1623	Meaning it would give us a real automaton.
764 6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1624	However, this would lead us far, far away from what we should
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1625	discuss here.
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1626
6718ef6143b8 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 753 diff changeset	1627
492 39b7ff2cf1bc updated Christian Urban <urbanc@in.tum.de> parents: 491 diff changeset	1628
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1629	%\section*{Further Reading}
270 4dbeaf43031d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 269 diff changeset	1630
490 4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1631	%Compare what a ``human expert'' would create as an automaton for the
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1632	%regular expression $a\cdot (b + c)^*$ and what the Thomson
4fee50f38305 updated Christian Urban <urbanc@in.tum.de> parents: 489 diff changeset	1633	%algorithm generates.
325 794c599cee53 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 324 diff changeset	1634
140 1be892087df2 added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	1635	\end{document}
1be892087df2 added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	1636
1be892087df2 added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	1637	%%% Local Variables:
1be892087df2 added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	1638	%%% mode: latex
1be892087df2 added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	1639	%%% TeX-master: t
1be892087df2 added Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	1640	%%% End:
482 0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1641
0f6e3c5a1751 updated Christian Urban <urbanc@in.tum.de> parents: 480 diff changeset	1642

author	Christian Urban <christian.urban@kcl.ac.uk>
	Tue, 19 Sep 2023 12:56:10 +0100 (17 months ago)
changeset 921	bb54e7aa1a3f
parent 874	ffe02fd574a5
child 926	42ecc3186944
permissions	-rw-r--r--