\documentclass{article}+ −
\usepackage{charter}+ −
\usepackage{hyperref}+ −
\usepackage{amssymb}+ −
\usepackage{amsmath}+ −
\usepackage[T1]{fontenc}+ −
\usepackage{listings}+ −
\usepackage{xcolor}+ −
\usepackage{tikz}+ −
\usetikzlibrary{arrows}+ −
\usetikzlibrary{automata}+ −
\usetikzlibrary{shapes}+ −
\usetikzlibrary{shadows}+ −
\usetikzlibrary{positioning}+ −
\usetikzlibrary{calc}+ −
\usetikzlibrary{fit}+ −
\usetikzlibrary{backgrounds}+ −
+ −
\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%+ −
+ −
\definecolor{javared}{rgb}{0.6,0,0} % for strings+ −
\definecolor{javagreen}{rgb}{0.25,0.5,0.35} % comments+ −
\definecolor{javapurple}{rgb}{0.5,0,0.35} % keywords+ −
\definecolor{javadocblue}{rgb}{0.25,0.35,0.75} % javadoc+ −
+ −
\lstdefinelanguage{scala}{+ −
morekeywords={abstract,case,catch,class,def,%+ −
do,else,extends,false,final,finally,%+ −
for,if,implicit,import,match,mixin,%+ −
new,null,object,override,package,%+ −
private,protected,requires,return,sealed,%+ −
super,this,throw,trait,true,try,%+ −
type,val,var,while,with,yield},+ −
otherkeywords={=>,<-,<\%,<:,>:,\#,@},+ −
sensitive=true,+ −
morecomment=[l]{//},+ −
morecomment=[n]{/*}{*/},+ −
morestring=[b]",+ −
morestring=[b]',+ −
morestring=[b]"""+ −
}+ −
+ −
\lstset{language=Scala,+ −
basicstyle=\ttfamily,+ −
keywordstyle=\color{javapurple}\bfseries,+ −
stringstyle=\color{javagreen},+ −
commentstyle=\color{javagreen},+ −
morecomment=[s][\color{javadocblue}]{/**}{*/},+ −
numbers=left,+ −
numberstyle=\tiny\color{black},+ −
stepnumber=1,+ −
numbersep=10pt,+ −
tabsize=2,+ −
showspaces=false,+ −
showstringspaces=false}+ −
+ −
\begin{document}+ −
+ −
\section*{Handout 3}+ −
+ −
Let us have a closer look at automata and their relation to regular expressions. This will help us to understand+ −
why the regular expression matchers in Python and Ruby are so slow with certain regular expressions. + −
+ −
A \emph{deterministic finite automaton} (DFA), say $A$, is defined by a four-tuple written $A(Q, q_0, F, \delta)$ where+ −
+ −
\begin{itemize}+ −
\item $Q$ is a set of states,+ −
\item $q_0 \in Q$ is the start state,+ −
\item $F \subseteq Q$ are the accepting states, and+ −
\item $\delta$ is the transition function.+ −
\end{itemize}+ −
+ −
\noindent+ −
The transition function determines how to ``transition'' from one state to the next state with respect to a character.+ −
We have the assumption that these functions do not need to be defined everywhere: so it can be the case that+ −
given a character there is no next state, in which case we need to raise a kind of ``raise an exception''. A typical + −
example of a DFA is+ −
+ −
\begin{center}+ −
\begin{tikzpicture}[>=stealth',very thick,auto,+ −
every state/.style={minimum size=0pt,inner sep=2pt,draw=blue!50,very thick,fill=blue!20},]+ −
\node[state,initial] (q_0) {$q_0$};+ −
\node[state] (q_1) [right=of q_0] {$q_1$};+ −
\node[state] (q_2) [below right=of q_0] {$q_2$};+ −
\node[state] (q_3) [right=of q_2] {$q_3$};+ −
\node[state, accepting] (q_4) [right=of q_1] {$q_4$};+ −
\path[->] (q_0) edge node [above] {$a$} (q_1);+ −
\path[->] (q_1) edge node [above] {$a$} (q_4);+ −
\path[->] (q_4) edge [loop right] node {$a, b$} ();+ −
\path[->] (q_3) edge node [right] {$a$} (q_4);+ −
\path[->] (q_2) edge node [above] {$a$} (q_3);+ −
\path[->] (q_1) edge node [right] {$b$} (q_2);+ −
\path[->] (q_0) edge node [above] {$b$} (q_2);+ −
\path[->] (q_2) edge [loop left] node {$b$} ();+ −
\path[->] (q_3) edge [bend left=95, looseness=1.3] node [below] {$b$} (q_0);+ −
\end{tikzpicture}+ −
\end{center}+ −
+ −
\noindent+ −
The accepting state $q_4$ is indicated with double circles. It is possible that a DFA has no+ −
accepting states at all, or that the starting state is also an accepting state.+ −
In the case above the transition function is defined everywhere and can be given as a table+ −
as follows:+ −
+ −
\[+ −
\begin{array}{lcl}+ −
(q_0, a) &\rightarrow& q_1\\+ −
(q_0, b) &\rightarrow& q_2\\+ −
(q_1, a) &\rightarrow& q_4\\+ −
(q_1, b) &\rightarrow& q_2\\+ −
(q_2, a) &\rightarrow& q_3\\+ −
(q_2, b) &\rightarrow& q_2\\+ −
(q_3, a) &\rightarrow& q_4\\+ −
(q_3, b) &\rightarrow& q_0\\+ −
(q_4, a) &\rightarrow& q_4\\+ −
(q_4, b) &\rightarrow& q_4\\+ −
\end{array}+ −
\]+ −
+ −
\noindent+ −
We need to define the notion of what language is accepted by an automaton. For this we + −
lift the transition function $\delta$ from characters to strings as follows:+ −
+ −
\[+ −
\begin{array}{lcl}+ −
\hat{\delta}(q, "") & \dn & q\\+ −
\hat{\delta}(q, c\!::\!s) & \dn & \hat{\delta}(\delta(q, c), s)\\+ −
\end{array}+ −
\]+ −
+ −
\noindent+ −
Given a string this means we start in the starting state and take the first character of the string,+ −
follow to the next sate, then take the second character and so on. Once the string is exhausted+ −
and we end up in an accepting state, then this string is accepted. Otherwise it is not accepted. + −
So $s$ in the \emph{language accepted by the automaton} $A(Q, q_0, F, \delta)$ iff+ −
+ −
\[+ −
\hat{\delta}(q_0, s) \in F + −
\]+ −
+ −
+ −
While with DFA it will always clear that given a character what the next state is, it will be useful to relax + −
this restriction. The resulting construction is called a \emph{non-deterministic finite automaton} (NFA) given+ −
as a four-tuple $A(Q, q_0, F, \rho)$ where+ −
+ −
\begin{itemize}+ −
\item $Q$ is a finite set of states+ −
\item $q_0$ is a start state+ −
\item $F$ are some accepting states with $F \subseteq Q$, and+ −
\item $\rho$ is a transition relation.+ −
\end{itemize}+ −
+ −
\noindent+ −
Two typical examples of NFAs are+ −
\begin{center}+ −
\begin{tabular}[t]{c@{\hspace{9mm}}c}+ −
\begin{tikzpicture}[scale=0.7,>=stealth',very thick,+ −
every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]+ −
\node[state,initial] (q_0) {$q_0$};+ −
\node[state] (q_1) [above=of q_0] {$q_1$};+ −
\node[state, accepting] (q_2) [below=of q_0] {$q_2$};+ −
\path[->] (q_0) edge node [left] {$\epsilon$} (q_1);+ −
\path[->] (q_0) edge node [left] {$\epsilon$} (q_2);+ −
\path[->] (q_0) edge [loop right] node {$a$} ();+ −
\path[->] (q_1) edge [loop above] node {$a$} ();+ −
\path[->] (q_2) edge [loop below] node {$b$} ();+ −
\end{tikzpicture} &+ −
+ −
\raisebox{20mm}{+ −
\begin{tikzpicture}[scale=0.7,>=stealth',very thick,+ −
every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]+ −
\node[state,initial] (r_1) {$r_1$};+ −
\node[state] (r_2) [above=of r_1] {$r_2$};+ −
\node[state, accepting] (r_3) [right=of r_1] {$r_3$};+ −
\path[->] (r_1) edge node [below] {$b$} (r_3);+ −
\path[->] (r_2) edge [bend left] node [above] {$a$} (r_3);+ −
\path[->] (r_1) edge [bend left] node [left] {$\epsilon$} (r_2);+ −
\path[->] (r_2) edge [bend left] node [right] {$a$} (r_1);+ −
\end{tikzpicture}}+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent+ −
There are a number of points you should note. Every DFA is a NFA, but not vice versa.+ −
The $\rho$ in NFAs is a transition \emph{relation} + −
(DFAs have a transition function). The difference between a function and a relation is that a function + −
has always a single output, while a relation gives, roughly speaking, several outputs. Look+ −
at the NFA on the right-hand side above: if you are currently in the state $r_2$ and you read a+ −
character $a$, then you can transition to $r_1$ \emph{or} $r_3$. Which route you take is not+ −
determined. This means if we need to decide whether a string is accepted by a NFA, we might have + −
to explore all possibilities. Also there is a special transition in NFAs which is called \emph{epsilon-transition}+ −
or \emph{silent transition}. This transition means you do not have to ``consume'' no part of+ −
the input string, but ``silently'' change to a different state.+ −
+ −
The reason for introducing NFAs is that there is a relatively simple (recursive) translation of regular expressions into+ −
NFAs. Consider the simple regular expressions $\varnothing$, $\epsilon$ and $c$. They can be translated+ −
as follows:+ −
+ −
\begin{center}+ −
\begin{tabular}[t]{l@{\hspace{10mm}}l}+ −
\raisebox{1mm}{$\varnothing$} & + −
\begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]+ −
\node[state, initial] (q_0) {$\mbox{}$};+ −
\end{tikzpicture}\\\\+ −
\raisebox{1mm}{$\epsilon$} & + −
\begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]+ −
\node[state, initial, accepting] (q_0) {$\mbox{}$};+ −
\end{tikzpicture}\\\\+ −
\raisebox{2mm}{$c$} & + −
\begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]+ −
\node[state, initial] (q_0) {$\mbox{}$};+ −
\node[state, accepting] (q_1) [right=of q_0] {$\mbox{}$};+ −
\path[->] (q_0) edge node [below] {$c$} (q_1);+ −
\end{tikzpicture}\\\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent+ −
The case for the sequence regular expression $r_1 \cdot r_2$ is as follows: We are given by recursion+ −
two automata representing $r_1$ and $r_2$ respectively. + −
+ −
\begin{center}+ −
\begin{tikzpicture}[node distance=3mm,+ −
>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]+ −
\node[state, initial] (q_0) {$\mbox{}$};+ −
\node (r_1) [right=of q_0] {$\ldots$};+ −
\node[state, accepting] (t_1) [right=of r_1] {$\mbox{}$};+ −
\node[state, accepting] (t_2) [above=of t_1] {$\mbox{}$};+ −
\node[state, accepting] (t_3) [below=of t_1] {$\mbox{}$};+ −
\node[state, initial] (a_0) [right=2.5cm of t_1] {$\mbox{}$};+ −
\node (b_1) [right=of a_0] {$\ldots$};+ −
\node[state, accepting] (c_1) [right=of b_1] {$\mbox{}$};+ −
\node[state, accepting] (c_2) [above=of c_1] {$\mbox{}$};+ −
\node[state, accepting] (c_3) [below=of c_1] {$\mbox{}$};+ −
\begin{pgfonlayer}{background}+ −
\node (1) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (q_0) (r_1) (t_1) (t_2) (t_3)] {};+ −
\node (2) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (a_0) (b_1) (c_1) (c_2) (c_3)] {};+ −
\node [yshift=2mm] at (1.north) {$r_1$};+ −
\node [yshift=2mm] at (2.north) {$r_2$};+ −
\end{pgfonlayer}+ −
\end{tikzpicture}+ −
\end{center}+ −
+ −
\noindent+ −
The first automaton has some accepting states. We obtain an automaton for $r_1\cdot r_2$ by connecting+ −
these accepting states with $\epsilon$-transitions to the starting state of the second automaton. By doing + −
so we make them non-accepting like so:+ −
+ −
\begin{center}+ −
\begin{tikzpicture}[node distance=3mm,+ −
>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]+ −
\node[state, initial] (q_0) {$\mbox{}$};+ −
\node (r_1) [right=of q_0] {$\ldots$};+ −
\node[state] (t_1) [right=of r_1] {$\mbox{}$};+ −
\node[state] (t_2) [above=of t_1] {$\mbox{}$};+ −
\node[state] (t_3) [below=of t_1] {$\mbox{}$};+ −
\node[state] (a_0) [right=2.5cm of t_1] {$\mbox{}$};+ −
\node (b_1) [right=of a_0] {$\ldots$};+ −
\node[state, accepting] (c_1) [right=of b_1] {$\mbox{}$};+ −
\node[state, accepting] (c_2) [above=of c_1] {$\mbox{}$};+ −
\node[state, accepting] (c_3) [below=of c_1] {$\mbox{}$};+ −
\path[->] (t_1) edge node [above, pos=0.3] {$\epsilon$} (a_0);+ −
\path[->] (t_2) edge node [above] {$\epsilon$} (a_0);+ −
\path[->] (t_3) edge node [below] {$\epsilon$} (a_0);+ −
+ −
\begin{pgfonlayer}{background}+ −
\node (3) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (q_0) (c_1) (c_2) (c_3)] {};+ −
\node [yshift=2mm] at (3.north) {$r_1\cdot r_2$};+ −
\end{pgfonlayer}+ −
\end{tikzpicture}+ −
\end{center}+ −
+ −
\noindent+ −
The case for the choice regular expression $r_1 + r_2$ is slightly different: We are given by recursion+ −
two automata representing $r_1$ and $r_2$ respectively. + −
+ −
\begin{center}+ −
\begin{tikzpicture}[node distance=3mm,+ −
>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]+ −
\node at (0,0) (1) {$\mbox{}$};+ −
\node[state, initial] (2) [above right=16mm of 1] {$\mbox{}$};+ −
\node[state, initial] (3) [below right=16mm of 1] {$\mbox{}$};+ −
+ −
\node (a) [right=of 2] {$\ldots$};+ −
\node[state, accepting] (a1) [right=of a] {$\mbox{}$};+ −
\node[state, accepting] (a2) [above=of a1] {$\mbox{}$};+ −
\node[state, accepting] (a3) [below=of a1] {$\mbox{}$};+ −
+ −
\node (b) [right=of 3] {$\ldots$};+ −
\node[state, accepting] (b1) [right=of b] {$\mbox{}$};+ −
\node[state, accepting] (b2) [above=of b1] {$\mbox{}$};+ −
\node[state, accepting] (b3) [below=of b1] {$\mbox{}$};+ −
\begin{pgfonlayer}{background}+ −
\node (1) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (2) (a1) (a2) (a3)] {};+ −
\node (2) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (3) (b1) (b2) (b3)] {};+ −
\node [yshift=3mm] at (1.north) {$r_1$};+ −
\node [yshift=3mm] at (2.north) {$r_2$};+ −
\end{pgfonlayer}+ −
\end{tikzpicture}+ −
\end{center}+ −
+ −
\noindent+ −
Each automaton has a single start state and potentially several accepting states. We obtain a+ −
NFA for the regular expression $r_1 + r_2$ by introducing a new starting state and connecting it+ −
with an $\epsilon$-transition to the two starting states above, like so+ −
+ −
\begin{center}+ −
\hspace{2cm}\begin{tikzpicture}[node distance=3mm,+ −
>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]+ −
\node at (0,0) [state, initial] (1) {$\mbox{}$};+ −
\node[state] (2) [above right=16mm of 1] {$\mbox{}$};+ −
\node[state] (3) [below right=16mm of 1] {$\mbox{}$};+ −
+ −
\node (a) [right=of 2] {$\ldots$};+ −
\node[state, accepting] (a1) [right=of a] {$\mbox{}$};+ −
\node[state, accepting] (a2) [above=of a1] {$\mbox{}$};+ −
\node[state, accepting] (a3) [below=of a1] {$\mbox{}$};+ −
+ −
\node (b) [right=of 3] {$\ldots$};+ −
\node[state, accepting] (b1) [right=of b] {$\mbox{}$};+ −
\node[state, accepting] (b2) [above=of b1] {$\mbox{}$};+ −
\node[state, accepting] (b3) [below=of b1] {$\mbox{}$};+ −
+ −
\path[->] (1) edge node [above] {$\epsilon$} (2);+ −
\path[->] (1) edge node [below] {$\epsilon$} (3);+ −
+ −
\begin{pgfonlayer}{background}+ −
\node (3) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (1) (a2) (a3) (b2) (b3)] {};+ −
\node [yshift=3mm] at (3.north) {$r_1+ r_2$};+ −
\end{pgfonlayer}+ −
\end{tikzpicture}+ −
\end{center}+ −
+ −
\noindent + −
Finally for the $*$-case we have an automaton for $r$+ −
+ −
\begin{center}+ −
\begin{tikzpicture}[node distance=3mm,+ −
>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]+ −
\node at (0,0) (1) {$\mbox{}$};+ −
\node[state, initial] (2) [right=16mm of 1] {$\mbox{}$};+ −
\node (a) [right=of 2] {$\ldots$};+ −
\node[state, accepting] (a1) [right=of a] {$\mbox{}$};+ −
\node[state, accepting] (a2) [above=of a1] {$\mbox{}$};+ −
\node[state, accepting] (a3) [below=of a1] {$\mbox{}$};+ −
\begin{pgfonlayer}{background}+ −
\node (1) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (2) (a1) (a2) (a3)] {};+ −
\node [yshift=3mm] at (1.north) {$r$};+ −
\end{pgfonlayer}+ −
\end{tikzpicture}+ −
\end{center}+ −
+ −
\noindent+ −
and connect its accepting states to a new starting state via $\epsilon$-transitions. This new+ −
starting state is also an accepting state, because $r^*$ can also recognise the empty string.+ −
This gives the following automaton for $r^*$:+ −
+ −
\begin{center}+ −
\begin{tikzpicture}[node distance=3mm,+ −
>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]+ −
\node at (0,0) [state, initial,accepting] (1) {$\mbox{}$};+ −
\node[state] (2) [right=16mm of 1] {$\mbox{}$};+ −
\node (a) [right=of 2] {$\ldots$};+ −
\node[state] (a1) [right=of a] {$\mbox{}$};+ −
\node[state] (a2) [above=of a1] {$\mbox{}$};+ −
\node[state] (a3) [below=of a1] {$\mbox{}$};+ −
\path[->] (1) edge node [above] {$\epsilon$} (2);+ −
\path[->] (a1) edge [bend left=45] node [above] {$\epsilon$} (1);+ −
\path[->] (a2) edge [bend right] node [below] {$\epsilon$} (1);+ −
\path[->] (a3) edge [bend left=45] node [below] {$\epsilon$} (1);+ −
\begin{pgfonlayer}{background}+ −
\node (2) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (1) (a2) (a3)] {};+ −
\node [yshift=3mm] at (2.north) {$r^*$};+ −
\end{pgfonlayer}+ −
\end{tikzpicture}+ −
\end{center}+ −
+ −
\noindent+ −
This construction of a NFA from a regular expression was invented by Ken Thompson in 1968.+ −
+ −
\end{document}+ −
+ −
%%% Local Variables: + −
%%% mode: latex+ −
%%% TeX-master: t+ −
%%% End: + −