Binary file handouts/ho02.pdf has changed
--- a/handouts/ho02.tex Fri Apr 28 11:01:25 2017 +0100
+++ b/handouts/ho02.tex Sun May 07 00:20:58 2017 +0100
@@ -52,16 +52,17 @@
\begin{tikzpicture}
\begin{axis}[
xlabel={$n$},
- x label style={at={(1.05,0.0)}},
+ x label style={at={(1.1,0.0)}},
+ %%xtick={0,1000000,...,5000000},
ylabel={time in secs},
enlargelimits=false,
ymax=35,
ytick={0,5,...,30},
axis lines=left,
- scaled ticks=false,
+ %scaled ticks=false,
width=6.5cm,
height=5cm,
- legend entries={Scala V3},
+ legend entries={Our matcher},
legend pos=north east,
legend cell align=left]
%\addplot[green,mark=square*,mark options={fill=white}] table {re2a.data};
@@ -69,7 +70,7 @@
\end{axis}
\end{tikzpicture}
\end{tabular}
-\end{center}
+\end{center}\bigskip
\begin{center}
Graphs: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$\\
@@ -110,7 +111,7 @@
axis lines=left,
width=6.5cm,
height=5cm,
- legend entries={Scala V3},
+ legend entries={Our matcher},
legend pos=north east,
legend cell align=left]
%\addplot[green,mark=square*,mark options={fill=white}] table {re2.data};
@@ -119,14 +120,14 @@
\end{tikzpicture}
\end{tabular}
\end{center}
-\medskip
+\bigskip
\noindent
-We will use these regular expressions and strings as running
-examples. There will be several versions (V1, V2, V3,\ldots) of the
-algorithm.\footnote{The corresponding files are \texttt{re1.scala},
- \texttt{re2.scala} and so on. As usual, you can find the code on
- KEATS.}\bigskip
+In what follows we will use these regular expressions and strings as
+running examples. There will be several versions (V1, V2, V3,\ldots)
+of our matcher.\footnote{The corresponding files are
+ \texttt{re1.scala}, \texttt{re2.scala} and so on. As usual, you can
+ find the code on KEATS.}\bigskip
\noindent
Having specified in the previous lecture what
@@ -138,7 +139,7 @@
s \in L(r)
\]
-\noindent we can look at an algorithm to solve this problem. Clearly
+\noindent we can look for an algorithm to solve this problem. Clearly
we cannot use the function $L$ directly for this, because in general
the set of strings $L$ returns is infinite (recall what $L(a^*)$ is).
In such cases there is no way we can implement an exhaustive test for
@@ -237,13 +238,14 @@
\end{equation}
\noindent If we can find an equivalent regular expression that is
-simpler (smaller for example), then this might potentially make our
-matching algorithm run faster. We can look for such a simpler regular
-expression $r'$ because whether a string $s$ is in $L(r)$ or in
-$L(r')$ with $r\equiv r'$ will always give the same answer. In the
-example above you will see that the regular expression is equivalent
-to just $r_1$. You can verify this by iteratively applying the
-simplification rules from above:
+simpler (that usually means smaller), then this might potentially make
+our matching algorithm run faster. We can look for such a simpler
+regular expression $r'$ because whether a string $s$ is in $L(r)$ or
+in $L(r')$ with $r\equiv r'$ will always give the same answer. Yes?
+
+In the example above you will see that the regular expression is
+equivalent to just $r_1$. You can verify this by iteratively applying
+the simplification rules from above:
\begin{center}
\begin{tabular}{ll}
@@ -264,7 +266,7 @@
$\ZERO$s, therefore simplifying them away will make the
algorithm quite a bit faster.
-Finally here are three equivalences between regulare expressions which are
+Finally here are three equivalences between regular expressions which are
not so obvious:
\begin{center}
@@ -276,9 +278,11 @@
\end{center}
\noindent
-You can try to establish them. As an aside, there has been a lot of research
-in questions like: Can one always decide when two regular expressions are
-equivalent or not? What does an algorithm look like to decide this?
+We will not use them in our algorithm, but feel free to make sure they
+hold. As an aside, there has been a lot of research about questions
+like: Can one always decide when two regular expressions are
+equivalent or not? What does an algorithm look like to decide this
+efficiently?
\subsection*{The Matching Algorithm}
@@ -315,10 +319,10 @@
\emph{derivative} of a regular expression. This is a function
which will take a regular expression, say $r$, and a
character, say $c$, as arguments and returns a new regular
-expression. Be careful that the intuition behind this function
+expression. Be mindful that the intuition behind this function
is not so easy to grasp on first reading. Essentially this
function solves the following problem: if $r$ can match a
-string of the form $c\!::\!s$, what does the regular
+string of the form $c\!::\!s$, what does a regular
expression look like that can match just $s$? The definition
of this function is as follows:
@@ -372,9 +376,9 @@
and ``append'' $r^*$ in order to match the rest of $s$. Still
makes sense?
-If all this did not make sense yet, here is another way to rationalise
-the definition of $\textit{der}$ by considering the following operation
-on sets:
+If all this did not make sense yet, here is another way to explain the
+definition of $\textit{der}$ by considering the following operation on
+sets:
\begin{equation}\label{Der}
\textit{Der}\,c\,A\;\dn\;\{s\,|\,c\!::\!s \in A\}
@@ -454,7 +458,7 @@
\end{center}
\noindent This function iterates $\textit{der}$ taking one character at
-the time from the original string until it is exhausted.
+the time from the original string until the string is exhausted.
Having $\textit{der}s$ in place, we can finally define our matching
algorithm:
@@ -493,11 +497,13 @@
\lstinputlisting[numbers=left,linebackgroundcolor=
{\ifodd\value{lstnumber}\color{capri!3}\fi}]
{../progs/app5.scala}
-\caption{Scala implementation of the \textit{nullable} and
+\caption{A Scala implementation of the \textit{nullable} and
derivative functions. These functions are easy to
- implement in functional languages, because their built-in pattern
+ implement in functional languages. This is because pattern
matching and recursion allow us to mimic the mathematical
- definitions very closely.\label{scala1}}
+ definitions very closely. Nearly all functional
+ programming languages support pattern matching and
+ recursion out of the box.\label{scala1}}
\end{figure}
@@ -538,8 +544,8 @@
For running the algorithm with our first example, the evil
regular expression $a^?{}^{\{n\}}a^{\{n\}}$, we need to implement
-the optional regular expression and the exactly $n$-times
-regular expression. This can be done with the translations
+the optional regular expression and the `exactly $n$-times
+regular expression'. This can be done with the translations
\lstinputlisting[numbers=none]{../progs/app51.scala}
@@ -572,8 +578,8 @@
\end{tikzpicture}
\end{center}
-\noindent Analysing this failure we notice that for
-$a^{\{n\}}$ we generate quite big regular expressions:
+\noindent Analysing this failure we notice that for $a^{\{n\}}$, for
+example, we generate quite big regular expressions:
\begin{center}
\begin{tabular}{rl}
@@ -591,7 +597,7 @@
large regular expressions will cause problems. This problem
is aggravated by $a^?$ being represented as $a + \ONE$.
-We can however fix this by having an explicit constructor for
+We can however fix this easily by having an explicit constructor for
$r^{\{n\}}$. In Scala we would introduce a constructor like
\begin{center}
@@ -634,8 +640,8 @@
25 seconds handle regular expressions up to $n = 1,100$ before a
StackOverflow is raised. Recall that Python and Ruby (and our first
version, Scala V1) could only handle $n = 27$ or so in 30
-seconds. There is no change for our second example $(a^*)^* \cdot
-b$---so this is still good.
+seconds. We have not tried our algorithm on the second example $(a^*)^* \cdot
+b$---but it is doing OK with it.
The moral is that our algorithm is rather sensitive to the
@@ -656,9 +662,9 @@
\end{center}
\noindent
-If we simplify them according to the simple rules from the
-beginning, we can replace the right-hand sides by the
-smaller equivalent regular expressions
+If we simplify them according to the simplification rules from the
+beginning, we can replace the right-hand sides by the smaller
+equivalent regular expressions
\begin{center}
\begin{tabular}{l}
@@ -726,7 +732,7 @@
up to 11,000 \texttt{a}s. Similarly, Java and Python needed 30
seconds to find out the regular expression $(a^*)^* \cdot b$ does not
match the string of 28 \texttt{a}s. We can do the same in
-for strings of 6,000,000 \texttt{a}s:
+for strings composed of nearly 6,000,000 \texttt{a}s:
\begin{center}
@@ -739,7 +745,7 @@
ymax=35,
ytick={0,5,...,30},
axis lines=left,
- scaled ticks=false,
+ %scaled ticks=false,
x label style={at={(1.09,0.0)}},
%xmax=7700000,
width=9cm,
@@ -755,12 +761,11 @@
\subsection*{Epilogue}
-(23/Aug/2016) I recently found another place where this algorithm can be
-sped up (this idea is not integrated with what is coming next,
-but I present it nonetheless). The idea is to define \texttt{ders}
-not such that it iterates the derivative character-by-character, but
-in bigger chunks. The resulting code for \texttt{ders2} looks as
-follows:
+(23/Aug/2016) I recently found another place where this algorithm can
+be sped up (this idea is not integrated with what is coming next, but
+I present it nonetheless). The idea is to not define \texttt{ders}
+that it iterates the derivative character-by-character, but in bigger
+chunks. The resulting code for \texttt{ders2} looks as follows:
\lstinputlisting[numbers=none]{../progs/app52.scala}
@@ -790,7 +795,7 @@
ymax=33,
%scaled ticks=false,
axis lines=left,
- width=5.5cm,
+ width=5.3cm,
height=5cm,
legend entries={Scala V3, Scala V4},
legend style={at={(0.1,-0.2)},anchor=north}]
@@ -806,12 +811,12 @@
x label style={at={(1.09,0.0)}},
ylabel={time in secs},
enlargelimits=false,
- xmax=8100000,
+ xmax=8200000,
ytick={0,5,...,30},
ymax=33,
%scaled ticks=false,
axis lines=left,
- width=5.5cm,
+ width=5.3cm,
height=5cm,
legend entries={Scala V3, Scala V4},
legend style={at={(0.1,-0.2)},anchor=north}]
@@ -827,7 +832,7 @@
You might not like doing proofs. But they serve a very
important purpose in Computer Science: How can we be sure that
-our algorithm matches its specification. We can try to test
+our algorithm matches its specification? We can try to test
the algorithm, but that often overlooks corner cases and an
exhaustive testing is impossible (since there are infinitely
many inputs). Proofs allow us to ensure that an algorithm
@@ -848,7 +853,7 @@
\end{tabular}
\end{center}
-\noindent If you want to show a property $P(r)$ for all
+\noindent If you want to show a property $P(r)$ for \emph{all}
regular expressions $r$, then you have to follow essentially
the recipe:
@@ -900,8 +905,9 @@
\label{propalt}
\end{equation}
-\noindent The difference to the base cases is that in this
-case we can already assume we proved
+\noindent The difference to the base cases is that in the inductive
+cases we can already assume we proved $P$ for the components, that is
+we can assume.
\begin{center}
\begin{tabular}{l}
@@ -910,9 +916,9 @@
\end{tabular}
\end{center}
-\noindent These are the induction hypotheses. To check this
+\noindent These are called the induction hypotheses. To check this
case, we can start from $\textit{nullable}(r_1 + r_2)$, which by
-definition is
+definition of $\textit{nullable}$ is
\[
\textit{nullable}(r_1) \vee \textit{nullable}(r_2)
@@ -935,18 +941,18 @@
[] \in L(r_1)\cup L(r_2)
\]
-\noindent but this is by definition of $L$ exactly $[] \in
-L(r_1 + r_2)$, which we needed to establish according to
+\noindent but this is by definition of $L$ exactly $[] \in L(r_1 +
+r_2)$, which we needed to establish according to statement in
\eqref{propalt}. What we have shown is that starting from
$\textit{nullable}(r_1 + r_2)$ we have done equivalent transformations
-to end up with $[] \in L(r_1 + r_2)$. Consequently we have
-established that $P(r_1 + r_2)$ holds.
+to end up with $[] \in L(r_1 + r_2)$. Consequently we have established
+that $P(r_1 + r_2)$ holds.
In order to complete the proof we would now need to look
at the cases \mbox{$P(r_1\cdot r_2)$} and $P(r^*)$. Again I let you
check the details.
-You might have to do induction proofs over strings.
+You might also have to do induction proofs over strings.
That means you want to establish a property $P(s)$ for all
strings $s$. For this remember strings are lists of
characters. These lists can be either the empty list or a
@@ -982,8 +988,9 @@
L(\textit{der}\,c\,r) = \textit{Der}\,c\,(L(r))
\]
-\noindent holds (this would be of course a property that
-needs to be proved in a side-lemma by induction on $r$).
+\noindent holds (this would be of course another property that needs
+to be proved in a side-lemma by induction on $r$). This is a bit
+more challenging, but not impossible.
To sum up, using reasoning like the one shown above allows us
to show the correctness of our algorithm. To see this,
@@ -1002,9 +1009,9 @@
\label{dersstep}
\end{equation}
-\noindent But we have shown above in \eqref{dersprop}, that
-the $\textit{Ders}$ can be replaced by $L(\textit{ders}\ldots)$. That means
-\eqref{dersstep} is equivalent to
+\noindent You agree? But we have shown above in \eqref{dersprop},
+that the $\textit{Ders}$ can be replaced by
+$L(\textit{ders}\ldots)$. That means \eqref{dersstep} is equivalent to
\begin{equation}
[] \in L(\textit{ders}\,s\,r)
@@ -1033,11 +1040,13 @@
s\in L(r)
\]
-\noindent which is the property we set out to prove:
-our algorithm meets its specification. To have done
-so, requires a few induction proofs about strings and
-regular expressions. Following the recipes is already a big
-step in performing these proofs.
+\noindent which is the property we set out to prove: our algorithm
+meets its specification. To have done so, requires a few induction
+proofs about strings and regular expressions. Following the \emph{induction
+recipes} is already a big step in actually performing these proofs.
+If you do not believe it, proofs have helped me to make sure my code
+is correct and in several instances prevented me of letting slip
+embarassing mistakes into the `wild'.
\end{document}
Binary file handouts/ho03.pdf has changed
--- a/handouts/ho03.tex Fri Apr 28 11:01:25 2017 +0100
+++ b/handouts/ho03.tex Sun May 07 00:20:58 2017 +0100
@@ -9,17 +9,19 @@
\section*{Handout 3 (Finite Automata)}
+
Every formal language and compiler course I know of bombards you first
with automata and then to a much, much smaller extend with regular
expressions. As you can see, this course is turned upside down:
regular expressions come first. The reason is that regular expressions
are easier to reason about and the notion of derivatives, although
already quite old, only became more widely known rather
-recently. Still let us in this lecture have a closer look at automata
+recently. Still, let us in this lecture have a closer look at automata
and their relation to regular expressions. This will help us with
understanding why the regular expression matchers in Python, Ruby and
Java are so slow with certain regular expressions. On the way we will
-also see what are the limitations of regular expressions.
+also see what are the limitations of regular
+expressions. Unfortunately, they cannot be used for \emph{everything}.
\subsection*{Deterministic Finite Automata}
@@ -38,11 +40,12 @@
\item $\delta$ is the transition function.
\end{itemize}
-\noindent The transition function determines how to ``transition''
-from one state to the next state with respect to a character. We have
-the assumption that these transition functions do not need to be
-defined everywhere: so it can be the case that given a character there
-is no next state, in which case we need to raise a kind of ``failure
+\noindent I am sure you have seen this defininition already
+before. The transition function determines how to ``transition'' from
+one state to the next state with respect to a character. We have the
+assumption that these transition functions do not need to be defined
+everywhere: so it can be the case that given a character there is no
+next state, in which case we need to raise a kind of ``failure
exception''. That means actually we have \emph{partial} functions as
transitions---see the Scala implementation of DFAs later on. A
typical example of a DFA is
@@ -118,8 +121,8 @@
\widehat{\delta}(Q_0, s) \in F
\]
-\noindent I let you think about a definition that describes
-the set of all strings accepted by an automaton.
+\noindent I let you think about a definition that describes the set of
+all strings accepted by a determinsitic finite automaton.
\begin{figure}[p]
\small
@@ -127,7 +130,7 @@
{\ifodd\value{lstnumber}\color{capri!3}\fi}]
{../progs/display/dfa.scala}
\caption{A Scala implementation of DFAs using partial functions.
- Notice some subtleties: \texttt{deltas} implements the delta-hat
+ Note some subtleties: \texttt{deltas} implements the delta-hat
construction by lifting the (partial) transition function to lists
of characters. Since \texttt{delta} is given as a partial function,
it can obviously go ``wrong'' in which case the \texttt{Try} in
@@ -239,9 +242,9 @@
\emph{or} to state $Q_2$ when receiving an $a$. Similarly in state
$Q_1$ and receiving an $a$, we can stay in state $Q_1$ \emph{or} go to
$Q_2$. This kind of choice is not allowed with DFAs. The downside of
-this choice is that when it comes to deciding whether a string is
+this choice in NFAs is that when it comes to deciding whether a string is
accepted by a NFA we potentially have to explore all possibilities. I
-let you think which kind of strings the above NFA accepts.
+let you think which strings the above NFA accepts.
There are a number of additional points you should note about
@@ -259,10 +262,10 @@
Perhaps interestingly, I do not actually use relations for my NFAs,
but use transition functions that return sets of states. DFAs have
partial transition functions that return a single state; my NFAs
-return a set. I let you think about this representation for
+return a set of states. I let you think about this representation for
NFA-transitions and how it corresponds to the relations used in the
mathematical definition of NFAs. An example of a transition function
-in Scala for the NFA above is
+in Scala for the NFA shown above is
{\small\begin{lstlisting}[language=Scala,linebackgroundcolor=
{\ifodd\value{lstnumber}\color{capri!3}\fi}]
@@ -338,7 +341,7 @@
well-known method for this is called \emph{Thompson Construction},
after the Turing Award winner Ken Thompson. This method is by
recursion over regular expressions and depends on the non-determinism
-in NFAs described in the earlier section. You will see shortly why
+in NFAs described in the previous section. You will see shortly why
this construction works well with NFAs, but is not so straightforward
with DFAs.
@@ -404,15 +407,15 @@
unfortunately. If we were to follow the many textbooks on the
subject, we would now start with defining what $\epsilon$NFAs
are---that would require extending the transition relation of
-NFAs. Next, show that the $\epsilon$NFAs are equivalent to NFAs and so
-on. Once we have done all this on paper, we would need to implement
-$\epsilon$NFAs. Lets try to take a shortcut instead. We are not really
-interested in $\epsilon$NFAs; they are only a convenient tool for
-translating regular expressions into automata. So we are not going to
-implementing them explicitly, but translate them immediately into NFAs
-(in a sense $\epsilon$NFAs are just a convenient API for lazy people ;o).
-How does the translation work? Well we have to find all transitions of
-the form
+NFAs. Next, we woudl show that the $\epsilon$NFAs are equivalent to
+NFAs and so on. Once we have done all this on paper, we would need to
+implement $\epsilon$NFAs. Lets try to take a shortcut instead. We are
+not really interested in $\epsilon$NFAs; they are only a convenient
+tool for translating regular expressions into automata. So we are not
+going to implementing them explicitly, but translate them immediately
+into NFAs (in a sense $\epsilon$NFAs are just a convenient API for
+lazy people ;o). How does the translation work? Well we have to find
+all transitions of the form
\[
q\stackrel{\epsilon}{\longrightarrow}\ldots\stackrel{\epsilon}{\longrightarrow}
@@ -444,10 +447,10 @@
and verify that I did not forget any transition.
So in what follows, whenever we are given an $\epsilon$NFA we will
-replace it by an equivalent NFA. The code for this is given in
-Figure~\ref{enfa}. The main workhorse in this code is a function that
-calculates a fixpoint of function (Lines 5--10). This function is used
-for ``discovering'' which states are reachable by
+replace it by an equivalent NFA. The Scala code for this translation
+is given in Figure~\ref{enfa}. The main workhorse in this code is a
+function that calculates a fixpoint of function (Lines 5--10). This
+function is used for ``discovering'' which states are reachable by
$\epsilon$-transitions. Once no new state is discovered, a fixpoint is
reached. This is used for example when calculating the starting states
of an equivalent NFA (see Line 36): we start with all starting states
@@ -466,11 +469,12 @@
{../progs/display/enfa.scala}
\caption{A Scala function that translates $\epsilon$NFA into NFAs. The
- transtions of $\epsilon$NFA take as input an \texttt{Option[C]}.
+ transtion function of $\epsilon$NFA takes as input an \texttt{Option[C]}.
\texttt{None} stands for an $\epsilon$-transition; \texttt{Some(c)}
- for a ``proper'' transition. The functions in Lines 18--26 calculate
+ for a ``proper'' transition consuming a character. The functions in
+ Lines 18--26 calculate
all states reachable by one or more $\epsilon$-transition for a given
- set of states. The NFA is constructed in in Lines 36--38.\label{enfa}}
+ set of states. The NFA is constructed in Lines 36--38.\label{enfa}}
\end{figure}
Also look carefully how the transitions of $\epsilon$NFAs are
@@ -508,7 +512,7 @@
$\ZERO$, $\ONE$ and $c$. They can be translated into equivalent NFAs
as follows:
-\begin{center}
+\begin{equation}\mbox{
\begin{tabular}[t]{l@{\hspace{10mm}}l}
\raisebox{1mm}{$\ZERO$} &
\begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
@@ -524,18 +528,18 @@
\node[state, accepting] (Q_1) [right=of Q_0] {$\mbox{}$};
\path[->] (Q_0) edge node [below] {$c$} (Q_1);
\end{tikzpicture}\\
-\end{tabular}
-\end{center}
+\end{tabular}}\label{simplecases}
+\end{equation}
\noindent
I let you think whether the NFAs can match exactly those strings the
regular expressions can match. To do this translation in code we need
a way to construct states programatically...and as an additional
-constrain Scala needs to recognise these states as being distinct.
+constrain Scala needs to recognise that these states are being distinct.
For this I implemented in Figure~\ref{thompson1} a class
\texttt{TState} that includes a counter and a companion object that
-increases this counter whenever a state is created.\footnote{You might
- have to read up what \emph{companion objects} are in Scala.}
+increases this counter whenever a new state is created.\footnote{You might
+ have to read up what \emph{companion objects} do in Scala.}
\begin{figure}[p]
\small
@@ -543,11 +547,12 @@
{\ifodd\value{lstnumber}\color{capri!3}\fi}]
{../progs/display/thompson1.scala}
\caption{The first part of the Thompson Construction. Lines 7--16
- implement a way how to create states that are all
+ implement a way of how to create new states that are all
distinct by virtue of a counter. This counter is
increased in the companion object of \texttt{TState}
whenever a new state is created. The code in Lines 24--40
- constructs NFAs for the simple regular expressions.
+ constructs NFAs for the simple regular expressions $\ZERO$, $\ONE$ and $c$.
+ Compare the pictures given in \eqref{simplecases}.
\label{thompson1}}
\end{figure}
@@ -564,57 +569,77 @@
(both given as partial functions).\label{thompson2}}
\end{figure}
-The case for the sequence regular expression $r_1 \cdot r_2$ is as
-follows: We are given by recursion two automata representing $r_1$ and
-$r_2$ respectively.
+The case for the sequence regular expression $r_1 \cdot r_2$ is a bit more
+complicated: We are given by recursion two NFAs representing the regular
+expressions $r_1$ and $r_2$ respectively.
\begin{center}
\begin{tikzpicture}[node distance=3mm,
- >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
+ >=stealth',very thick,
+ every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
\node[state, initial] (Q_0) {$\mbox{}$};
-\node (r_1) [right=of Q_0] {$\ldots$};
-\node[state, accepting] (t_1) [right=of r_1] {$\mbox{}$};
-\node[state, accepting] (t_2) [above=of t_1] {$\mbox{}$};
-\node[state, accepting] (t_3) [below=of t_1] {$\mbox{}$};
-\node[state, initial] (a_0) [right=2.5cm of t_1] {$\mbox{}$};
-\node (b_1) [right=of a_0] {$\ldots$};
+\node[state, initial] (Q_01) [below=1mm of Q_0] {$\mbox{}$};
+\node[state, initial] (Q_02) [above=1mm of Q_0] {$\mbox{}$};
+\node (R_1) [right=of Q_0] {$\ldots$};
+\node[state, accepting] (T_1) [right=of R_1] {$\mbox{}$};
+\node[state, accepting] (T_2) [above=of T_1] {$\mbox{}$};
+\node[state, accepting] (T_3) [below=of T_1] {$\mbox{}$};
+
+\node (A_0) [right=2.5cm of T_1] {$\mbox{}$};
+\node[state, initial] (A_01) [above=1mm of A_0] {$\mbox{}$};
+\node[state, initial] (A_02) [below=1mm of A_0] {$\mbox{}$};
+
+\node (b_1) [right=of A_0] {$\ldots$};
\node[state, accepting] (c_1) [right=of b_1] {$\mbox{}$};
\node[state, accepting] (c_2) [above=of c_1] {$\mbox{}$};
\node[state, accepting] (c_3) [below=of c_1] {$\mbox{}$};
\begin{pgfonlayer}{background}
-\node (1) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (Q_0) (r_1) (t_1) (t_2) (t_3)] {};
-\node (2) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (a_0) (b_1) (c_1) (c_2) (c_3)] {};
+ \node (1) [rounded corners, inner sep=1mm, thick,
+ draw=black!60, fill=black!20, fit= (Q_0) (R_1) (T_1) (T_2) (T_3)] {};
+ \node (2) [rounded corners, inner sep=1mm, thick,
+ draw=black!60, fill=black!20, fit= (A_0) (b_1) (c_1) (c_2) (c_3)] {};
\node [yshift=2mm] at (1.north) {$r_1$};
\node [yshift=2mm] at (2.north) {$r_2$};
\end{pgfonlayer}
\end{tikzpicture}
\end{center}
-\noindent The first automaton has some accepting states. We
-obtain an automaton for $r_1\cdot r_2$ by connecting these
-accepting states with $\epsilon$-transitions to the starting
-state of the second automaton. By doing so we make them
+\noindent The first NFA has some accepting states and the second some
+starting states. We obtain $\epsilon$NFA for $r_1\cdot r_2$ by
+connecting these accepting states with $\epsilon$-transitions to the
+starting states of the second automaton. By doing so we make them
non-accepting like so:
\begin{center}
\begin{tikzpicture}[node distance=3mm,
- >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
+ >=stealth',very thick,
+ every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
\node[state, initial] (Q_0) {$\mbox{}$};
+\node[state, initial] (Q_01) [below=1mm of Q_0] {$\mbox{}$};
+\node[state, initial] (Q_02) [above=1mm of Q_0] {$\mbox{}$};
\node (r_1) [right=of Q_0] {$\ldots$};
\node[state] (t_1) [right=of r_1] {$\mbox{}$};
\node[state] (t_2) [above=of t_1] {$\mbox{}$};
\node[state] (t_3) [below=of t_1] {$\mbox{}$};
-\node[state] (a_0) [right=2.5cm of t_1] {$\mbox{}$};
-\node (b_1) [right=of a_0] {$\ldots$};
+
+\node (A_0) [right=2.5cm of t_1] {$\mbox{}$};
+\node[state] (A_01) [above=1mm of A_0] {$\mbox{}$};
+\node[state] (A_02) [below=1mm of A_0] {$\mbox{}$};
+
+\node (b_1) [right=of A_0] {$\ldots$};
\node[state, accepting] (c_1) [right=of b_1] {$\mbox{}$};
\node[state, accepting] (c_2) [above=of c_1] {$\mbox{}$};
\node[state, accepting] (c_3) [below=of c_1] {$\mbox{}$};
-\path[->] (t_1) edge node [above, pos=0.3] {$\epsilon$} (a_0);
-\path[->] (t_2) edge node [above] {$\epsilon$} (a_0);
-\path[->] (t_3) edge node [below] {$\epsilon$} (a_0);
-
+\path[->] (t_1) edge (A_01);
+\path[->] (t_2) edge node [above] {$\epsilon$} (A_01);
+\path[->] (t_3) edge (A_01);
+\path[->] (t_1) edge (A_02);
+\path[->] (t_2) edge (A_02);
+\path[->] (t_3) edge node [below] {$\epsilon$} (A_02);
+
\begin{pgfonlayer}{background}
-\node (3) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (Q_0) (c_1) (c_2) (c_3)] {};
+ \node (3) [rounded corners, inner sep=1mm, thick,
+ draw=black!60, fill=black!20, fit= (Q_0) (c_1) (c_2) (c_3)] {};
\node [yshift=2mm] at (3.north) {$r_1\cdot r_2$};
\end{pgfonlayer}
\end{tikzpicture}
@@ -626,7 +651,8 @@
\begin{center}
\begin{tikzpicture}[node distance=3mm,
- >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
+ >=stealth',very thick,
+ every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
\node at (0,0) (1) {$\mbox{}$};
\node[state, initial] (2) [above right=16mm of 1] {$\mbox{}$};
\node[state, initial] (3) [below right=16mm of 1] {$\mbox{}$};
@@ -706,7 +732,7 @@
also an accepting state, because $r^*$ can recognise the
empty string. This gives the following automaton for $r^*$:
-\begin{center}
+\begin{center}
\begin{tikzpicture}[node distance=3mm,
>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
\node at (0,0) [state, initial,accepting] (1) {$\mbox{}$};
@@ -726,16 +752,116 @@
\end{tikzpicture}
\end{center}
+{\small\begin{lstlisting}[language=Scala,linebackgroundcolor=
+ {\ifodd\value{lstnumber}\color{capri!3}\fi}]
+def thompson (r: Rexp) : NFAt = r match {
+ case ZERO => NFA_ZERO()
+ case ONE => NFA_ONE()
+ case CHAR(c) => NFA_CHAR(c)
+ case ALT(r1, r2) => NFA_ALT(thompson(r1), thompson(r2))
+ case SEQ(r1, r2) => NFA_SEQ(thompson(r1), thompson(r2))
+ case STAR(r1) => NFA_STAR(thompson(r1))
+}
+\end{lstlisting}}
+
+
+\begin{center}
+\begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1mm}}c@{}}
+\begin{tikzpicture}
+\begin{axis}[
+ title={Graph: $\texttt{(a*)*\,b}$ and strings
+ $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
+ xlabel={$n$},
+ x label style={at={(1.05,0.0)}},
+ ylabel={time in secs},
+ enlargelimits=false,
+ xtick={0,5,...,30},
+ xmax=33,
+ ymax=35,
+ ytick={0,5,...,30},
+ scaled ticks=false,
+ axis lines=left,
+ width=5.5cm,
+ height=4.5cm,
+ legend entries={Python, Java},
+ legend pos=outer north east,
+ legend cell align=left]
+ \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
+ \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
+ % depth-first search in NFAs
+ \addplot[red,mark=*, mark options={fill=white}] table {
+ 1 0.00605
+ 2 0.03086
+ 3 0.11994
+ 4 0.45389
+ 5 2.06192
+ 6 8.04894
+ 7 32.63549
+ };
+\end{axis}
+\end{tikzpicture}
+&
+\begin{tikzpicture}
+\begin{axis}[
+ title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings
+ $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
+ xlabel={$n$},
+ x label style={at={(1.05,0.0)}},
+ ylabel={time in secs},
+ enlargelimits=false,
+ xtick={0,5,...,30},
+ xmax=33,
+ ymax=35,
+ ytick={0,5,...,30},
+ scaled ticks=false,
+ axis lines=left,
+ width=5.5cm,
+ height=4.5cm,
+ legend entries={Python,Ruby},
+ legend pos=south east,
+ legend cell align=left]
+ \addplot[blue,mark=*, mark options={fill=white}] table {re-python.data};
+ \addplot[brown,mark=triangle*, mark options={fill=white}] table {re-ruby.data};
+ % breath-first search in NFAs
+ \addplot[red,mark=*, mark options={fill=white}] table {
+ 1 0.00741
+ 2 0.02657
+ 3 0.08317
+ 4 0.22133
+ 5 0.54463
+ 6 1.42062
+ 7 4.04926
+ 8 12.70395
+ 9 39.21555
+ 10 112.05466
+ };
+\end{axis}
+\end{tikzpicture}
+\end{tabular}
+\end{center}
+
+
\noindent This construction of a NFA from a regular expression
was invented by Ken Thompson in 1968.
\subsubsection*{Subset Construction}
-What is interesting is that for every NFA we can find a DFA
-which recognises the same language. This can, for example, be
-done by the \emph{subset construction}. Consider again the NFA
-below on the left, consisting of nodes labelled $0$, $1$ and $2$.
+Remember that we did not bother with defining and implementing
+$\epsilon$NFA; we immediately translated them into equivalent
+NFAs. Equivalent in the sense of accepting the same language (though
+we only claimed this and did not prove it rigorously). Remember also
+that NFAs have a non-deterministic transitions, given as a relation.
+This non-determinism makes it harder to decide when a string is
+accepted or not; such a decision is rather straightforward with DFAs
+(remember their transition function).
+
+What is interesting is that for every NFA we can find a DFA that also
+recognises the same language. This might sound like a bit paradoxical,
+but I litke to show you this next. There are a number of ways of
+transforming a NFA into an equivalent DFA, but the most famous is
+\emph{subset construction}. Consider again the NFA below on the left,
+consisting of nodes labelled, say, with $0$, $1$ and $2$.
\begin{center}
\begin{tabular}{c@{\hspace{10mm}}c}
--- a/handouts/scala-ho.tex Fri Apr 28 11:01:25 2017 +0100
+++ b/handouts/scala-ho.tex Sun May 07 00:20:58 2017 +0100
@@ -982,6 +982,12 @@
inherited from the JVM that can be really annoying. For
example a fixed stack size. One can work around this
particular limitation, but why does one have to?
+More such `puzzles' can be found at
+
+\begin{center}
+ \url{http://scalapuzzlers.com} and
+ \url{http://latkin.org/blog/2017/05/02/when-the-scala-compiler-doesnt-help/}
+\end{center}
Even if Scala has been a success in several high-profile
companies, there is also a company (Yammer) that first used
--- a/progs/display/enfa.scala Fri Apr 28 11:01:25 2017 +0100
+++ b/progs/display/enfa.scala Sun May 07 00:20:58 2017 +0100
@@ -19,7 +19,7 @@
applyOrElse(delta, (q, None))
def enexts(qs: Set[A]) : Set[A] =
- qs | qs.flatMap(enext(_))
+ qs | qs.flatMap(enext(_)) // | is the set-union in Scala
// epsilon closure
def ecl(qs: Set[A]) : Set[A] =
--- a/progs/display/thompson1.scala Fri Apr 28 11:01:25 2017 +0100
+++ b/progs/display/thompson1.scala Sun May 07 00:20:58 2017 +0100
@@ -16,23 +16,23 @@
}
-// some types abbreviations
+// a type abbreviation
type NFAt = NFA[TState, Char]
-// NFA that does not accept any string
+// a NFA that does not accept any string
def NFA_ZERO(): NFAt = {
val Q = TState()
NFA(Set(Q), { case _ => Set() }, Set())
}
-// NFA that accepts the empty string
+// a NFA that accepts the empty string
def NFA_ONE() : NFAt = {
val Q = TState()
NFA(Set(Q), { case _ => Set() }, Set(Q))
}
-// NFA that accepts the string "c"
+// a NFA that accepts the string "c"
def NFA_CHAR(c: Char) : NFAt = {
val Q1 = TState()
val Q2 = TState()
--- a/progs/display/thompson2.scala Fri Apr 28 11:01:25 2017 +0100
+++ b/progs/display/thompson2.scala Sun May 07 00:20:58 2017 +0100
@@ -1,6 +1,6 @@
// Thompson Construction (Part 2)
-// some more types abbreviations
+// some more type abbreviations
type NFAtrans = (TState, Char) :=> Set[TState]
type eNFAtrans = (TState, Option[Char]) :=> Set[TState]
--- a/progs/enfa.scala Fri Apr 28 11:01:25 2017 +0100
+++ b/progs/enfa.scala Sun May 07 00:20:58 2017 +0100
@@ -19,7 +19,7 @@
applyOrElse(delta, (q, None))
def enexts(qs: Set[A]) : Set[A] =
- qs | qs.flatMap(enext(_))
+ qs | qs.flatMap(enext(_)) // | is the set-union in Scala
// epsilon closure
def ecl(qs: Set[A]) : Set[A] =
--- a/progs/nfa.scala Fri Apr 28 11:01:25 2017 +0100
+++ b/progs/nfa.scala Sun May 07 00:20:58 2017 +0100
@@ -79,3 +79,10 @@
nfa1.accepts2("aaaaabbb".toList) // true
nfa1.accepts2("aaaaabbbaaa".toList) // false
nfa1.accepts2("ac".toList) // false
+
+
+
+
+// subset constructions
+
+def subset(nfa: NFA[A, C]) : DFA[Set[A], C] =
--- a/progs/thompson.scala Fri Apr 28 11:01:25 2017 +0100
+++ b/progs/thompson.scala Sun May 07 00:20:58 2017 +0100
@@ -102,14 +102,6 @@
case STAR(r1) => NFA_STAR(thompson(r1))
}
-
-def tmatches(r: Rexp, s: String) : Boolean =
- thompson(r).accepts(s.toList)
-
-def tmatches2(r: Rexp, s: String) : Boolean =
- thompson(r).accepts2(s.toList)
-
-
//optional regular expression (one or zero times)
def OPT(r: Rexp) = ALT(r, ONE)
@@ -121,8 +113,16 @@
}
+def tmatches(r: Rexp, s: String) : Boolean =
+ thompson(r).accepts(s.toList)
+
+def tmatches2(r: Rexp, s: String) : Boolean =
+ thompson(r).accepts2(s.toList)
+
+
// Test Cases
+
// the evil regular expression a?{n} a{n}
def EVIL1(n: Int) = SEQ(NTIMES(OPT(CHAR('a')), n), NTIMES(CHAR('a'), n))
@@ -135,7 +135,7 @@
for (j <- 1 to i) code
val end = System.nanoTime()
(end - start)/(i * 1.0e9)
-
+}
// the size of the NFA can be large,
// thus slowing down the breadth-first search