--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/LINKS Tue Sep 26 14:08:49 2017 +0100
@@ -0,0 +1,80 @@
+assmebly (calculator RPN)
+
+https://dere.github.io/2017-02-12/beginners-assembly-part1/
+
+
+webassembly
+https://sourceware.org/ml/binutils/2017-03/msg00044.html
+https://hacks.mozilla.org/2017/02/a-cartoon-intro-to-webassembly/
+https://webassembly.github.io/spec/
+
+webassembly explorer
+https://mbebenita.github.io/WasmExplorer/
+
+ARM
+https://azeria-labs.com/writing-arm-assembly-part-1/
+
+JVM
+https://www.toptal.com/scala/scala-bytecode-and-the-jvm
+
+Growing a compiler
+http://www.cs.dartmouth.edu/~mckeeman/cs48/mxcom/gem/html/GrowingCompiler.html
+
+free books
+https://github.com/vhf/free-programming-books/blob/master/free-programming-books.md
+https://john.cs.olemiss.edu/~hcc/csci658/notes/Free_Prog_Lang_Textbooks.html
+
+MIPS
+http://courses.missouristate.edu/kenvollmar/mars/
+
+PEG
+
+https://github.com/taocpp/PEGTL
+
+
+Parser
+https://www.reddit.com/r/programming/comments/615hoz/how_to_write_a_recursive_descent_parser/
+https://www.reddit.com/r/ProgrammingLanguages/comments/60gmgc/writing_a_recursive_descent_expression_parser/
+http://www.craftinginterpreters.com/parsing-expressions.html
+
+
+small languages
+https://github.com/Michael2109/cobalt
+http://www.red-lang.org/p/about.html
+http://craftinginterpreters.com/contents.html
+https://michaelhaywoodblog.wordpress.com
+https://ruslanspivak.com/lsbasi-part1/
+http://selfie.cs.uni-salzburg.at
+
+
+automata
+https://www7.in.tum.de/um/courses/auto/ws1314/script/autonotes.pdf
+
+Reges helpers
+https://regex101.com
+http://www.regular-expressions.info/tutorial.html
+
+
+Regex performance benchmark
+https://rust-leipzig.github.io/regex/2017/03/28/comparison-of-regex-engines/
+https://github.com/k-takata/Onigmo
+
+Sulzmann
+https://github.com/pippijn/dreml/
+
+
+Scala parser
+http://enear.github.io/2016/03/31/parser-combinators/
+
+
+ANTLR megatutorial
+https://tomassetti.me/antlr-mega-tutorial/
+
+From regex to LLVM
+https://www.youtube.com/watch?v=Ukqb6nMjFyk
+
+
+
+
+Static code analysis
+https://medium.com/@Coder_HarryLee/videos-about-static-code-analysis-7654b40f9a3b
\ No newline at end of file
Binary file coursework/cw01.pdf has changed
--- a/coursework/cw01.tex Tue Sep 26 14:07:29 2017 +0100
+++ b/coursework/cw01.tex Tue Sep 26 14:08:49 2017 +0100
@@ -2,11 +2,15 @@
\usepackage{../style}
\usepackage{../langs}
+\usepackage{array}
+
+
\begin{document}
+\newcolumntype{C}[1]{>{\centering}m{#1}}
\section*{Coursework 1 (Strand 1)}
-This coursework is worth 4\% and is due on 25 October at
+This coursework is worth 4\% and is due on 19 October at
16:00. You are asked to implement a regular expression matcher
and submit a document containing the answers for the questions
below. You can do the implementation in any programming
@@ -25,7 +29,7 @@
uploaded to KEATS, which you can freely use.\bigskip
-\subsubsection*{Task}
+\subsection*{Task}
The task is to implement a regular expression matcher based on
derivatives of regular expressions. The implementation should
@@ -40,25 +44,41 @@
\begin{center}
\begin{tabular}{ll}
-$[c_1 c_2 \ldots c_n]$ & a range of characters\\
-$r^+$ & one or more times $r$\\
-$r^?$ & optional $r$\\
-$r^{\{n,m\}}$ & at least $n$-times $r$ but no more than $m$-times\\
-$\sim{}r$ & not-regular expression of $r$\\
+ $[c_1,c_2,\ldots,c_n]$ & a set of characters---for character ranges\\
+ $r^+$ & one or more times $r$\\
+ $r^?$ & optional $r$\\
+ $r^{\{n\}}$ & exactly $n$-times\\
+ $r^{\{..m\}}$ & zero or more times $r$ but no more than $m$-times\\
+ $r^{\{n..\}}$ & at least $n$-times $r$\\
+ $r^{\{n..m\}}$ & at least $n$-times $r$ but no more than $m$-times\\
+ $\sim{}r$ & not-regular-expression of $r$\\
\end{tabular}
\end{center}
-\noindent In the case of $r^{\{n,m\}}$ you can assume the
-convention that $0 \le n \le m$. The meanings of the extended
-regular expressions are
+\noindent You can assume that $n$ and $m$ are greater or equal than
+$0$. In the case of $r^{\{n,m\}}$ you can also assume $0 \le n \le m$.\bigskip
+
+\noindent {\bf Important!} Your implementation should have explicit
+cases for the basic regular expressions, but also for explicit cases for
+the extended regular expressions. That means do not treat the extended
+regular expressions by just translating them into the basic ones. See
+also Question 2, where you are asked to explicitly give the rules for
+\textit{nullable} and \textit{der} for the extended regular
+expressions.\newpage
+
+\noindent
+The meanings of the extended regular expressions are
\begin{center}
\begin{tabular}{r@{\hspace{2mm}}c@{\hspace{2mm}}l}
-$L([c_1 c_2 \ldots c_n])$ & $\dn$ & $\{[c_1], [c_2], \ldots, [c_n]\}$\\
-$L(r^+)$ & $\dn$ & $\bigcup_{1\le i}. L(r)^i$\\
-$L(r^?)$ & $\dn$ & $L(r) \cup \{[]\}$\\
-$L(r^{\{n,m\}})$ & $\dn$ & $\bigcup_{n\le i \le m}. L(r)^i$\\
-$L(\sim{}r)$ & $\dn$ & $\Sigma^* - L(r)$
+ $L([c_1,c_2,\ldots,c_n])$ & $\dn$ & $\{[c_1], [c_2], \ldots, [c_n]\}$\\
+ $L(r^+)$ & $\dn$ & $\bigcup_{1\le i}.\;L(r)^i$\\
+ $L(r^?)$ & $\dn$ & $L(r) \cup \{[]\}$\\
+ $L(r^{\{n\}})$ & $\dn$ & $L(r)^n$\\
+ $L(r^{\{..m\}})$ & $\dn$ & $\bigcup_{0\le i \le m}.\;L(r)^i$\\
+ $L(r^{\{n..\}})$ & $\dn$ & $\bigcup_{n\le i}.\;L(r)^i$\\
+ $L(r^{\{n..m\}})$ & $\dn$ & $\bigcup_{n\le i \le m}.\;L(r)^i$\\
+ $L(\sim{}r)$ & $\dn$ & $\Sigma^* - L(r)$
\end{tabular}
\end{center}
@@ -66,52 +86,56 @@
for the set of \emph{all} strings over the alphabet $\Sigma$
(in the implementation the alphabet can be just what is
represented by, say, the type \pcode{Char}). So $\sim{}r$
-means `all the strings that $r$ cannot match'.
+means in effect ``all the strings that $r$ cannot match''.\medskip
+\noindent
Be careful that your implementation of \textit{nullable} and
-\textit{der} satisfies for every $r$ the following two
-properties (see also Question 2):
+\textit{der} satisfies for every regular expression $r$ the following
+two properties (see also Question 2):
\begin{itemize}
\item $\textit{nullable}(r)$ if and only if $[]\in L(r)$
\item $L(der\,c\,r) = Der\,c\,(L(r))$
\end{itemize}
-\noindent {\bf Important!} Your implementation should have
-explicit cases for the basic regular expressions, but also
-explicit cases for the extended regular expressions. That
-means do not treat the extended regular expressions by just
-translating them into the basic ones. See also Question 2,
-where you are asked to explicitly give the rules for
-\textit{nullable} and \textit{der} for the extended regular
-expressions.
\subsection*{Question 1}
What is your King's email address (you will need it in
-Question 3)?
+Question 4)?
\subsection*{Question 2}
-This question does not require any implementation. From the
+From the
lectures you have seen the definitions for the functions
\textit{nullable} and \textit{der} for the basic regular
-expressions. Give the rules for the extended regular
+expressions. Implement the rules for the extended regular
expressions:
\begin{center}
\begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}
-$\textit{nullable}([c_1 c_2 \ldots c_n])$ & $\dn$ & $?$\\
-$\textit{nullable}(r^+)$ & $\dn$ & $?$\\
-$\textit{nullable}(r^?)$ & $\dn$ & $?$\\
-$\textit{nullable}(r^{\{n,m\}})$ & $\dn$ & $?$\\
-$\textit{nullable}(\sim{}r)$ & $\dn$ & $?$\medskip\\
-$der\, c\, ([c_1 c_2 \ldots c_n])$ & $\dn$ & $?$\\
-$der\, c\, (r^+)$ & $\dn$ & $?$\\
-$der\, c\, (r^?)$ & $\dn$ & $?$\\
-$der\, c\, (r^{\{n,m\}})$ & $\dn$ & $?$\\
-$der\, c\, (\sim{}r)$ & $\dn$ & $?$\\
+ $\textit{nullable}([c_1,c_2,\ldots,c_n])$ & $\dn$ & $?$\\
+ $\textit{nullable}(r^+)$ & $\dn$ & $?$\\
+ $\textit{nullable}(r^?)$ & $\dn$ & $?$\\
+ $\textit{nullable}(r^{\{n\}})$ & $\dn$ & $?$\\
+ $\textit{nullable}(r^{\{..m\}})$ & $\dn$ & $?$\\
+ $\textit{nullable}(r^{\{n..\}})$ & $\dn$ & $?$\\
+ $\textit{nullable}(r^{\{n..m\}})$ & $\dn$ & $?$\\
+ $\textit{nullable}(\sim{}r)$ & $\dn$ & $?$
+\end{tabular}
+\end{center}
+
+\begin{center}
+\begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}
+ $der\, c\, ([c_1,c_2,\ldots,c_n])$ & $\dn$ & $?$\\
+ $der\, c\, (r^+)$ & $\dn$ & $?$\\
+ $der\, c\, (r^?)$ & $\dn$ & $?$\\
+ $der\, c\, (r^{\{n\}})$ & $\dn$ & $?$\\
+ $der\, c\, (r^{\{..m\}})$ & $\dn$ & $?$\\
+ $der\, c\, (r^{\{n..\}})$ & $\dn$ & $?$\\
+ $der\, c\, (r^{\{n..m\}})$ & $\dn$ & $?$\\
+ $der\, c\, (\sim{}r)$ & $\dn$ & $?$\\
\end{tabular}
\end{center}
@@ -123,12 +147,80 @@
\item $L(der\,c\,r)) = Der\,c\,(L(r))$
\end{itemize}
+\noindent
+Given the definitions of \textit{nullable} and \textit{der}, it is
+easy to implement a regular expression matcher. Test your regular
+expression matcher with (at least) the examples:
+
+
+\begin{center}
+\def\arraystretch{1.2}
+\begin{tabular}{r|m{12mm}|m{12mm}|m{12mm}|m{12mm}|m{12mm}|m{12mm}}
+ string & $a^{\{3\}}$ & $(a^?)^{\{3\}}$ & $a^{\{..3\}}$ &
+ $(a^?)^{\{..3\}}$ & $a^{\{3..5\}}$ & $(a^?)^{\{3..5\}}$\\\hline
+ $[]$ &&&&&& \\\hline
+ \texttt{a} &&&&&& \\\hline
+ \texttt{aa} &&&&&& \\\hline
+ \texttt{aaa} &&&&&& \\\hline
+ \texttt{aaaaa} &&&&&& \\\hline
+ \texttt{aaaaaa}&&&&&& \\
+\end{tabular}
+\end{center}
+
+\noindent
+Does your matcher produce the expected results?
+
\subsection*{Question 3}
-Implement the following regular expression for email addresses
+As you can see, there are a number of explicit regular expressions
+that deal with single or several characters, for example:
+
+\begin{center}
+\begin{tabular}{ll}
+ $c$ & matches a single character\\
+ $[c_1,c_2,\ldots,c_n]$ & matches a set of characters---for character ranges\\
+ $\textit{ALL}$ & matches any character
+\end{tabular}
+\end{center}
+
+\noindent
+the latter is useful for matching any string (for example
+by using $\textit{ALL}^*$). In order to avoid having an explicit constructor
+for each case, we can generalise all these cases and introduce a single
+constructor $\textit{CFUN}(f)$ where $f$ is a function from characters
+to a boolean. The idea is that the function $f$ determines which character(s)
+are matched, namely those where $f$ returns \texttt{true}.
+In this question implement \textit{CFUN} and define
+
+\begin{center}
+\begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}
+ $\textit{nullable}(\textit{CFUN}(f))$ & $\dn$ & $?$\\
+ $\textit{der}\,c\,(\textit{CFUN}(f))$ & $\dn$ & $?$
+\end{tabular}
+\end{center}
+
+\noindent in your matcher and then also give definitions for
+
+\begin{center}
+\begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}
+ $c$ & $\dn$ & $\textit{CFUN}(?)$\\
+ $[c_1,c_2,\ldots,c_n]$ & $\dn$ & $\textit{CFUN}(?)$\\
+ $\textit{ALL}$ & $\dn$ & $\textit{CFUN}(?)$
+\end{tabular}
+\end{center}
+
+
+\subsection*{Question 4}
+
+Suppose $[a\mbox{-}z0\mbox{-}9\_\,.\mbox{-}]$ stands for the regular expression
+
+\[[a,b,c,\ldots,z,0,\dots,9,\_,.,\mbox{-}]\;.\]
+
+\noindent
+Define in your code the following regular expression for email addresses
\[
-([a\mbox{-}z0\mbox{-}9\_\!\_\,.-]^+)\cdot @\cdot ([a\mbox{-}z0\mbox{-}9\,.-]^+)\cdot .\cdot ([a\mbox{-}z\,.]^{\{2,6\}})
+([a\mbox{-}z0\mbox{-}9\_\,.-]^+)\cdot @\cdot ([a\mbox{-}z0\mbox{-}9\,.-]^+)\cdot .\cdot ([a\mbox{-}z\,.]^{\{2,6\}})
\]
\noindent and calculate the derivative according to your email
@@ -151,13 +243,12 @@
\noindent Write down your simplified derivative in a readable
notation using parentheses where necessary. That means you
should use the infix notation $+$, $\cdot$, $^*$ and so on,
-instead of code.
+instead of code.\bigskip
-\subsection*{Question 4}
-
-Suppose \textit{[a-z]} stands for the range regular expression
-$[a,b,c,\ldots,z]$. Consider the regular expression $/ \cdot * \cdot
-(\sim{}([a\mbox{-}z]^* \cdot * \cdot / \cdot [a\mbox{-}z]^*)) \cdot *
+\noindent
+Implement the simplification rules in your regular expression matcher.
+Consider the regular expression $/ \cdot * \cdot
+(\sim{}(\textit{ALL}^* \cdot * \cdot / \cdot \textit{ALL}^*)) \cdot *
\cdot /$ and decide wether the following four strings are matched by
this regular expression. Answer yes or no.
@@ -169,23 +260,7 @@
\end{enumerate}
\noindent
-Also test your regular expression matcher with the regular
-expression $a^{\{3,5\}}$ and the strings
-
-\begin{enumerate}
-\setcounter{enumi}{4}
-\item \texttt{aa}
-\item \texttt{aaa}
-\item \texttt{aaaaa}
-\item \texttt{aaaaaa}
-\end{enumerate}
-
-\noindent
-Does your matcher produce the expected results?
-
-\subsection*{Question 5}
-
-Let $r_1$ be the regular expression $a\cdot a\cdot a$ and $r_2$ be
+Also let $r_1$ be the regular expression $a\cdot a\cdot a$ and $r_2$ be
$(a^{\{19,19\}}) \cdot (a^?)$. Decide whether the following three
strings consisting of $a$s only can be matched by $(r_1^+)^+$.
Similarly test them with $(r_2^+)^+$. Again answer in all six cases
@@ -197,6 +272,7 @@
to not introducing any other character.
\begin{enumerate}
+\setcounter{enumi}{4}
\item \texttt{"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\\
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\\
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"}
@@ -209,6 +285,7 @@
\end{enumerate}
+
\end{document}
%%% Local Variables:
Binary file coursework/cw02.pdf has changed
--- a/coursework/cw02.tex Tue Sep 26 14:07:29 2017 +0100
+++ b/coursework/cw02.tex Tue Sep 26 14:08:49 2017 +0100
@@ -6,7 +6,7 @@
\section*{Coursework 2 (Strand 1)}
-\noindent This coursework is worth 5\% and is due on 4
+\noindent This coursework is worth 5\% and is due on 3
November at 16:00. You are asked to implement the Sulzmann \&
Lu lexer for the WHILE language. You can do the
implementation in any programming language you like, but you
@@ -88,7 +88,7 @@
\begin{center}
\begin{tabular}{ll}
-$[c_1 c_2 \ldots c_n]$ & a range of characters\\
+$[c_1,c_2,\ldots,c_n]$ & a set of characters\\
$r^+$ & one or more times $r$\\
$r^?$ & optional $r$\\
$r^{\{n\}}$ & n-times $r$\\
@@ -96,7 +96,7 @@
\end{center}
\noindent
-Later on you will also need the record regular expressions:
+Later on you will also need the record regular expression:
\begin{center}
\begin{tabular}{ll}
@@ -105,8 +105,9 @@
\end{center}
\noindent Try to design your regular expressions to be as
-small as possible. For example you should use character ranges
-for identifiers and numbers.
+small as possible. For example you should use character sets
+for identifiers and numbers. Feel free to use the general
+character constructor \textit{CFUN} introduced in CW 1.
\subsection*{Question 2}
@@ -119,11 +120,11 @@
\begin{center}
\begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}
-$mkeps([c_1 c_2 \ldots c_n])$ & $\dn$ & $?$\\
+$mkeps([c_1,c_2,\ldots,c_n])$ & $\dn$ & $?$\\
$mkeps(r^+)$ & $\dn$ & $?$\\
$mkeps(r^?)$ & $\dn$ & $?$\\
$mkeps(r^{\{n\}})$ & $\dn$ & $?$\medskip\\
-$inj\, ([c_1 c_2 \ldots c_n])\,c\,\ldots$ & $\dn$ & $?$\\
+$inj\, ([c_1,c_2,\ldots,c_n])\,c\,\ldots$ & $\dn$ & $?$\\
$inj\, (r^+)\,c\,\ldots$ & $\dn$ & $?$\\
$inj\, (r^?)\,c\,\ldots$ & $\dn$ & $?$\\
$inj\, (r^{\{n\}})\,c\,\ldots$ & $\dn$ & $?$\\
@@ -175,16 +176,12 @@
\begin{figure}[p]
-\begin{center}
\mbox{\lstinputlisting[language=while]{../progs/fib.while}}
-\end{center}
\caption{Fibonacci program in the WHILE language.\label{fib}}
\end{figure}
\begin{figure}[p]
-\begin{center}
\mbox{\lstinputlisting[language=while]{../progs/loops.while}}
-\end{center}
\caption{The three-nested-loops program in the WHILE language.
Usually used for timing measurements.\label{loop}}
\end{figure}
Binary file coursework/cw03.pdf has changed
--- a/coursework/cw03.tex Tue Sep 26 14:07:29 2017 +0100
+++ b/coursework/cw03.tex Tue Sep 26 14:08:49 2017 +0100
@@ -6,7 +6,7 @@
\section*{Coursework 3}
-\noindent This coursework is worth 5\% and is due on 24
+\noindent This coursework is worth 5\% and is due on 23
November at 16:00. You are asked to implement a parser for the
WHILE language and also an interpreter. You can do the
implementation in any programming language you like, but you
@@ -117,7 +117,7 @@
how long does your interpreter take when \pcode{start}
is initialised with 100, 500 and so on. How far can
you scale this value if you are willing to wait, say
-1 Minute.
+1 Minute?
\begin{figure}[p]
\begin{center}
Binary file coursework/cw04.pdf has changed
--- a/coursework/cw04.tex Tue Sep 26 14:07:29 2017 +0100
+++ b/coursework/cw04.tex Tue Sep 26 14:08:49 2017 +0100
@@ -9,7 +9,7 @@
\section*{Coursework 4 (Strand 1)}
-\noindent This coursework is worth 6\% and is due on 13
+\noindent This coursework is worth 6\% and is due on 7
December at 16:00. You are asked to implement a compiler for
the WHILE language that targets the assembler language
provided by Jasmin or Krakatau (both have very similar
@@ -100,9 +100,10 @@
\end{center}
\noindent This assembler is largely compatible with the Jasmin
-syntax---that means for the files we look are concerned with,
+syntax---that means for the files we are concerned with here,
it understands the same input syntax (no changes to your
-compiler need to be made). You can generate Java Byte Code by
+compiler need to be made; ok maybe some small syntactic
+adjustments are needed). You can generate Java Byte Code by
using
\begin{center}
@@ -112,7 +113,8 @@
\noindent where you may have to adapt the directory where
Krakatau is installed (I just downloaded the zip file from
Github and \pcode{Krakatau-master} was the directory where it
-was installed).
+was installed). Again the resulting class-file you can run with
+\texttt{java}.
%\noindent You need to submit a document containing the answers
@@ -134,7 +136,8 @@
the console which Fibonacci number and which Factorial should
be calculated. The Fibonacci program is given in
Figure~\ref{fibs}. You can write your own program for
-calculating factorials.
+calculating factorials. Submit your assembler code as
+a file that can be run, not as PDF-text.
\begin{figure}[t]
\lstinputlisting[language=while]{../progs/fib.while}
Binary file coursework/cw05.pdf has changed
--- a/coursework/cw05.tex Tue Sep 26 14:07:29 2017 +0100
+++ b/coursework/cw05.tex Tue Sep 26 14:08:49 2017 +0100
@@ -6,7 +6,7 @@
\section*{Coursework (Strand 2)}
-\noindent This coursework is worth 20\% and is due on 13 December at
+\noindent This coursework is worth 20\% and is due on 7 December at
16:00. You are asked to prove the correctness of the regular
expression matcher from the lectures using the Isabelle theorem
prover. You need to submit a theory file containing this proof. The
--- a/data.sty Tue Sep 26 14:07:29 2017 +0100
+++ b/data.sty Tue Sep 26 14:08:49 2017 +0100
@@ -142,18 +142,18 @@
%% re2.scala example a?{n} a{n}
\begin{filecontents}{re2.data}
-1 0.00006
-101 0.00605
-201 0.04343
-301 0.16469
-401 0.41954
-501 0.83703
-601 1.66925
-701 2.71086
-801 4.19745
-901 7.56495
-1001 12.42103
-1101 20.41763
+1 0.00050
+101 0.02030
+201 0.10587
+301 0.31188
+401 0.32794
+501 0.64490
+601 1.16738
+701 2.10815
+801 3.47144
+901 6.80621
+1001 12.35611
+1101 23.80084
\end{filecontents}
%% re2.scala: example (a*)* b
@@ -182,36 +182,50 @@
%% re3.scala: example a?{n} a{n}
\begin{filecontents}{re3.data}
-1 0.00005
-1001 0.63505
-2001 2.53029
-3001 5.72804
-4001 9.94246
-5001 15.52770
-6001 22.44126
-7001 30.86867
-8001 39.32242
-9001 48.96998
+1 0.00003
+1001 0.03887
+2001 0.15666
+3001 0.35910
+4001 0.63950
+5001 1.00241
+6001 1.50480
+7001 2.11568
+8001 2.71208
+9001 3.41157
+10001 4.19962
+11001 5.70387
\end{filecontents}
%% re3.scala: example (a*)* b
\begin{filecontents}{re3a.data}
-1 0.00014
-500001 2.61059
-1000001 5.42773
-1500001 8.02603
-2000001 10.49844
-2500001 13.34234
-3000001 16.17491
-3500001 19.11650
-4000001 21.66151
-4500001 24.85496
-5000001 28.52113
-5500001 28.54548
-6000001 32.39523
-6500001 45.13486
-7000001 54.15018
-7500001 71.32218
+1 0.00003
+500001 0.22527
+1000001 0.62752
+1500001 0.88485
+2000001 1.39815
+2500001 1.68619
+3000001 1.94957
+3500001 2.15878
+4000001 2.59918
+4500001 5.90679
+5000001 13.11295
+5500001 19.15376
+6000001 40.16373
+\end{filecontents}
+\begin{filecontents}{re3b.data}
+1 0.00015
+500001 0.28337
+1000001 0.53271
+1500001 0.84478
+2000001 1.11763
+2500001 1.76656
+3000001 2.13310
+3500001 2.39576
+4000001 2.98624
+4500001 5.96529
+5000001 13.56911
+5500001 18.43089
+6000001 40.33704
\end{filecontents}
%% re4.scala example a?{n} a{n}
Binary file handouts/graphs.pdf has changed
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/handouts/graphs.tex Tue Sep 26 14:08:49 2017 +0100
@@ -0,0 +1,130 @@
+\documentclass{article}
+\usepackage{../style}
+\usepackage{../langs}
+\usepackage{../graphics}
+\usepackage{../data}
+
+
+\begin{document}
+
+
+\section*{Benchmarks for $(a^*)^* b$ and $a^{?\{n\}} a^{\{n\}}$}
+
+\mbox{}\bigskip
+
+\begin{center}
+$(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$\medskip\\
+\begin{tabular}{@{}cc@{}}
+\raisebox{5mm}{
+\begin{tikzpicture}
+\begin{axis}[
+ xlabel={$n$},
+ x label style={at={(1.05,0.0)}},
+ ylabel={time in secs},
+ enlargelimits=false,
+ xtick={0,5,...,30},
+ xmax=33,
+ ymax=35,
+ ytick={0,5,...,30},
+ scaled ticks=false,
+ axis lines=left,
+ width=5cm,
+ height=5cm,
+ legend entries={Java, Python},
+ legend pos=north west,
+ legend cell align=left]
+\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
+\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
+\end{axis}
+\end{tikzpicture}}
+&
+\begin{tikzpicture}
+ \begin{axis}[
+ xlabel={$n$},
+ x label style={at={(1.1,0.0)}},
+ %%xtick={0,1000000,...,5000000},
+ ylabel={time in secs},
+ enlargelimits=false,
+ ymax=35,
+ ytick={0,5,...,30},
+ axis lines=left,
+ %scaled ticks=false,
+ width=6.5cm,
+ height=5cm,
+ legend entries={Derivative matcher},
+ legend pos=north east,
+ legend cell align=left]
+%\addplot[green,mark=square*,mark options={fill=white}] table {re2a.data};
+\addplot[black,mark=square*,mark options={fill=white}] table {re3a.data};
+\end{axis}
+\end{tikzpicture}
+\end{tabular}
+\end{center}\bigskip
+
+\begin{center}
+$a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$\medskip\\
+\begin{tabular}{@{}cc@{}}
+\begin{tikzpicture}
+\begin{axis}[
+ xlabel={$n$},
+ x label style={at={(1.05,0.0)}},
+ ylabel={\small time in secs},
+ enlargelimits=false,
+ xtick={0,5,...,30},
+ xmax=33,
+ ymax=35,
+ ytick={0,5,...,30},
+ scaled ticks=false,
+ axis lines=left,
+ width=5cm,
+ height=5cm,
+ legend entries={Python,Ruby},
+ legend pos=north west,
+ legend cell align=left]
+\addplot[blue,mark=*, mark options={fill=white}] table {re-python.data};
+\addplot[brown,mark=triangle*, mark options={fill=white}] table {re-ruby.data};
+\end{axis}
+\end{tikzpicture}
+&
+\begin{tikzpicture}
+ \begin{axis}[
+ xlabel={$n$},
+ x label style={at={(1.1,0.05)}},
+ ylabel={\small time in secs},
+ enlargelimits=false,
+ xtick={0,2500,...,11000},
+ xmax=12000,
+ ymax=35,
+ ytick={0,5,...,30},
+ scaled ticks=false,
+ axis lines=left,
+ width=6.5cm,
+ height=5cm,
+ legend entries={Derivative matcher},
+ legend pos=north east,
+ legend cell align=left]
+%\addplot[green,mark=square*,mark options={fill=white}] table {re2.data};
+\addplot[black,mark=square*,mark options={fill=white}] table {re3.data};
+\end{axis}
+\end{tikzpicture}
+\end{tabular}
+\end{center}
+
+
+\subsubsection*{Sources}
+
+
+\url{http://talisker.inf.kcl.ac.uk/cgi-bin/repos.cgi/afl-material/raw-file/tip/progs/catastrophic.java}\medskip
+
+\noindent
+\url{http://talisker.inf.kcl.ac.uk/cgi-bin/repos.cgi/afl-material/raw-file/tip/progs/catastrophic.py}
+
+\end{document}
+
+
+
+
+%%% Local Variables:
+%%% mode: latex
+%%% TeX-master: t
+%%% End:
Binary file handouts/ho01.pdf has changed
--- a/handouts/ho01.tex Tue Sep 26 14:07:29 2017 +0100
+++ b/handouts/ho01.tex Tue Sep 26 14:08:49 2017 +0100
@@ -3,6 +3,8 @@
\usepackage{../langs}
\usepackage{../graphics}
\usepackage{../data}
+\usepackage{lstlinebgrd}
+\definecolor{capri}{rgb}{0.0, 0.75, 1.0}
%%http://regexcrossword.com/challenges/cities/puzzles/1
%%https://jex.im/regulex/
@@ -16,6 +18,8 @@
%% email validator
%% http://www.ex-parrot.com/%7Epdw/Mail-RFC822-Address.html
+% https://jackfoxy.github.io/FsRegEx/emailregex.html
+
%% regex testers
% https://regex101.com
@@ -37,22 +41,24 @@
\section*{Handout 1}
This module is about text processing, be it for web-crawlers,
-compilers, dictionaries, DNA-data and so on. When looking for a
+compilers, dictionaries, DNA-data, ad filters and so on. When looking for a
particular string, like $abc$ in a large text we can use the
Knuth-Morris-Pratt algorithm, which is currently the most efficient
general string search algorithm. But often we do \emph{not} just look
-for a particular string, but for string patterns. For example in
+for a particular string, but for string patterns. For example, in
program code we need to identify what are the keywords (if, then,
-while, etc), what are the identifiers (variable names). A pattern for
+while, for, etc), what are the identifiers (variable names). A pattern for
identifiers could be stated as: they start with a letter, followed by
-zero or more letters, numbers and underscores. Also often we face the
+zero or more letters, numbers and underscores. Often we also face the
problem that we are given a string (for example some user input) and
want to know whether it matches a particular pattern---be it an email
address, for example. In this way we can exclude user input that would
otherwise have nasty effects on our program (crashing it or making it
-go into an infinite loop, if not worse). The point is that the fast
-Knuth-Morris-Pratt algorithm for strings is not good enough for such
-string patterns.\smallskip
+go into an infinite loop, if not worse). In tools like Snort, scanning
+for computer viruses or filtering out spam usually involves scanning
+for some signature (essentially a string pattern). The point is that
+the fast Knuth-Morris-Pratt algorithm for strings is not good enough
+for such string \emph{patterns}.\smallskip
\defn{Regular expressions} help with conveniently specifying
such patterns. The idea behind regular expressions is that
@@ -188,7 +194,31 @@
matching in Python, Ruby and Java.
\begin{center}
-\begin{tabular}{@{\hspace{-1mm}}c@{\hspace{-1mm}}c@{}}
+\begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1mm}}c@{}}
+\begin{tikzpicture}
+\begin{axis}[
+ title={Graph: $\texttt{(a*)*\,b}$ and strings
+ $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
+ xlabel={$n$},
+ x label style={at={(1.05,0.0)}},
+ ylabel={time in secs},
+ enlargelimits=false,
+ xtick={0,5,...,30},
+ xmax=33,
+ ymax=35,
+ ytick={0,5,...,30},
+ scaled ticks=false,
+ axis lines=left,
+ width=5.5cm,
+ height=4.5cm,
+ legend entries={Python, Java},
+ legend pos=north west,
+ legend cell align=left]
+\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
+\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
+\end{axis}
+\end{tikzpicture}
+&
\begin{tikzpicture}
\begin{axis}[
title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings
@@ -212,47 +242,24 @@
\addplot[brown,mark=triangle*, mark options={fill=white}] table {re-ruby.data};
\end{axis}
\end{tikzpicture}
-&
-\begin{tikzpicture}
-\begin{axis}[
- title={Graph: $\texttt{(a*)*\,b}$ and strings
- $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
- xlabel={$n$},
- x label style={at={(1.05,0.0)}},
- ylabel={time in secs},
- enlargelimits=false,
- xtick={0,5,...,30},
- xmax=33,
- ymax=35,
- ytick={0,5,...,30},
- scaled ticks=false,
- axis lines=left,
- width=5.5cm,
- height=4.5cm,
- legend entries={Python, Java},
- legend pos=north west,
- legend cell align=left]
-\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
-\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
-\end{axis}
-\end{tikzpicture}
\end{tabular}
\end{center}
-\noindent This first graph shows that Python needs approximately 29
-seconds for finding out whether a string of 28 \texttt{a}s matches the
-regular expression \texttt{a?\{28\}\,a\{28\}}. Ruby is even slightly
+\noindent This first graph shows that Python and Java need
+approximately 30 seconds to find out that the regular expression
+$\texttt{(a*)*\,b}$ does not match strings of 28 \texttt{a}s.
+Similarly, the second shows that Python needs approximately 29 seconds
+for finding out whether a string of 28 \texttt{a}s matches the regular
+expression \texttt{a?\{28\}\,a\{28\}}. Ruby is even slightly
worse.\footnote{In this example Ruby uses the slightly different
regular expression \texttt{a?a?a?...a?a?aaa...aa}, where the
\texttt{a?} and \texttt{a} each occur $n$ times. More such test
cases can be found at \url{http://www.computerbytesman.com/redos/}.}
-Simlarly, Python and Java needs approximately 30 seconds to find out
-that the regular expression $\texttt{(a*)*\,b}$ does not match strings
-of 28 \texttt{a}s. Admittedly, these regular expressions are
-carefully chosen to exhibit this exponential behaviour, but similar
-ones occur more often than one wants in ``real life''. For example, on
-20 July 2016 a similar regular expression brought the webpage
-\href{http://stackexchange.com}{Stack Exchange} to its knees:
+Admittedly, these regular expressions are carefully chosen to exhibit
+this exponential behaviour, but similar ones occur more often than one
+wants in ``real life''. For example, on 20 July 2016 a similar regular
+expression brought the webpage \href{http://stackexchange.com}{Stack
+ Exchange} to its knees:
\begin{center}
\url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}
@@ -291,9 +298,9 @@
at a relatively simple algorithm that solves this problem much
better than Python and Ruby do\ldots actually it will be two
versions of the algorithm: the first one will be able to
-process strings of approximately 1,000 \texttt{a}s in 30
+process strings of approximately 1,100 \texttt{a}s in 23
seconds, while the second version will even be able to process
-up to 7,000(!) in 30 seconds, see the graph below:
+up to 11,000(!) in 5 seconds, see the graph below:
\begin{center}
\begin{tikzpicture}
@@ -304,8 +311,8 @@
x label style={at={(1.05,0.0)}},
ylabel={time in secs},
enlargelimits=false,
- xtick={0,3000,...,9000},
- xmax=10000,
+ xtick={0,3000,...,12000},
+ xmax=13000,
ymax=32,
ytick={0,5,...,30},
scaled ticks=false,
@@ -531,7 +538,7 @@
precisely specify when a string $s$ is matched by a regular
expression $r$, namely if and only if $s \in L(r)$. In fact we
will write a program \pcode{match} that takes any string $s$
-and any regular expression $r$ as argument and returns
+and any regular expression $r$ as arguments and returns
\emph{yes}, if $s \in L(r)$ and \emph{no}, if $s \not\in
L(r)$. We leave this for the next lecture.
@@ -636,9 +643,9 @@
Python, Ruby and Java in some instances and the problems in Stack
Exchange and the Atom editor). People who are not very familiar with
the mathematical background of regular expressions get them
-consistently wrong (surprising given they are a supposed to be core
-skill for computer scientists). The hope is that we can do better in
-the future---for example by proving that the algorithms actually
+consistently wrong (this is surprising given they are a supposed to be
+core skill for computer scientists). The hope is that we can do better
+in the future---for example by proving that the algorithms actually
satisfy their specification and that the corresponding implementations
do not contain any bugs. We are close, but not yet quite there.
@@ -647,15 +654,15 @@
``theoretical'' shortcomings, for example recognising strings of the
form $a^{n}b^{n}$ is not possible with regular expressions. This means
for example if we try to regognise whether parentheses are well-nested
-is impossible with (basic) regular expressions. I am not so bothered
-by these shortcomings. What I am bothered about is when regular
-expressions are in the way of practical programming. For example, it
-turns out that the regular expression for email addresses shown in
-\eqref{email} is hopelessly inadequate for recognising all of them
-(despite being touted as something every computer scientist should
-know about). The W3 Consortium (which standardises the Web) proposes
-to use the following, already more complicated regular expressions for
-email addresses:
+in an expression is impossible with (basic) regular expressions. I am
+not so bothered by these shortcomings. What I am bothered about is
+when regular expressions are in the way of practical programming. For
+example, it turns out that the regular expression for email addresses
+shown in \eqref{email} is hopelessly inadequate for recognising all of
+them (despite being touted as something every computer scientist
+should know about). The W3 Consortium (which standardises the Web)
+proposes to use the following, already more complicated regular
+expressions for email addresses:
{\small\begin{lstlisting}[language={},keywordstyle=\color{black},numbers=none]
[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*
@@ -682,11 +689,21 @@
addresses. Still it is good to know that some tasks in text
processing just cannot be achieved by using regular
expressions. But for what we want to use them (lexing) they are
-pretty good.
+pretty good.\medskip
+
+\noindent
+Finally there is a joke about regular expressions:
+
+\begin{quote}\it
+ ``Sometimes you have a programming problem and it seems like the
+ best solution is to use regular expressions; now you have two
+ problems.''
+\end{quote}
-\begin{figure}[p]
-\lstinputlisting{../progs/crawler1.scala}
+\begin{figure}[p]\small
+ \lstinputlisting[linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
+ {../progs/crawler1.scala}
\caption{The Scala code for a simple web-crawler that checks
for broken links in a web-page. It uses the regular expression
@@ -697,8 +714,10 @@
\end{figure}
-\begin{figure}[p]
-\lstinputlisting{../progs/crawler2.scala}
+
+\begin{figure}[p]\small
+ \lstinputlisting[linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
+ {../progs/crawler2.scala}
\caption{A version of the web-crawler that only follows links
in ``my'' domain---since these are the ones I am interested in
@@ -711,8 +730,9 @@
\end{figure}
-\begin{figure}[p]
-\lstinputlisting{../progs/crawler3.scala}
+\begin{figure}[p]\small
+ \lstinputlisting[linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
+ {../progs/crawler3.scala}
\caption{A small email harvester---whenever we download a
web-page, we also check whether it contains any email
Binary file handouts/ho02.pdf has changed
--- a/handouts/ho02.tex Tue Sep 26 14:07:29 2017 +0100
+++ b/handouts/ho02.tex Tue Sep 26 14:08:49 2017 +0100
@@ -12,16 +12,65 @@
\section*{Handout 2 (Regular Expression Matching)}
This lecture is about implementing a more efficient regular expression
-matcher (the plots on the right)---more efficient than the matchers
-from regular expression libraries in Ruby, Python and Java (the plots
-on the left). The first pair of plots show the running time for the
-regular expressions $a^?{}^{\{n\}}\cdot a^{\{n\}}$ and strings composed
-of $n$ \pcode{a}s. The second pair of plots show the running time
-for the regular expression $(a^*)^*\cdot b$ and also strings composed
-of $n$ \pcode{a}s (meaning this regular expression actually does not
+matcher (the plots on the right below)---more efficient than the
+matchers from regular expression libraries in Ruby, Python and Java
+(the plots on the left). The first pair of plots shows the running time
+for the regular expression $(a^*)^*\cdot b$ and strings composed of
+$n$ \pcode{a}s (meaning this regular expression actually does not
+match the strings). The second pair of plots shows the running time for
+the regular expressions $a^?{}^{\{n\}}\cdot a^{\{n\}}$ and strings
+also composed of $n$ \pcode{a}s (this time the regular expressions
match the strings). To see the substantial differences in the left
-and right plots below, note the different scales of the $x$-axes.
+and right plots below, note the different scales of the $x$-axes.
+
+\begin{center}
+Graphs: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$
+\begin{tabular}{@{}cc@{}}
+\begin{tikzpicture}
+\begin{axis}[
+ xlabel={$n$},
+ x label style={at={(1.05,0.0)}},
+ ylabel={time in secs},
+ enlargelimits=false,
+ xtick={0,5,...,30},
+ xmax=33,
+ ymax=35,
+ ytick={0,5,...,30},
+ scaled ticks=false,
+ axis lines=left,
+ width=5cm,
+ height=5cm,
+ legend entries={Java, Python},
+ legend pos=north west,
+ legend cell align=left]
+\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
+\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
+\end{axis}
+\end{tikzpicture}
+&
+\begin{tikzpicture}
+ \begin{axis}[
+ xlabel={$n$},
+ x label style={at={(1.1,0.0)}},
+ %%xtick={0,1000000,...,5000000},
+ ylabel={time in secs},
+ enlargelimits=false,
+ ymax=35,
+ ytick={0,5,...,30},
+ axis lines=left,
+ %scaled ticks=false,
+ width=6.5cm,
+ height=5cm,
+ legend entries={Our matcher},
+ legend pos=north east,
+ legend cell align=left]
+%\addplot[green,mark=square*,mark options={fill=white}] table {re2a.data};
+\addplot[black,mark=square*,mark options={fill=white}] table {re3a.data};
+\end{axis}
+\end{tikzpicture}
+\end{tabular}
+\end{center}\bigskip
\begin{center}
Graphs: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$\\
@@ -54,68 +103,33 @@
x label style={at={(1.1,0.05)}},
ylabel={\small time in secs},
enlargelimits=false,
- xtick={0,3000,...,9000},
- xmax=10000,
+ xtick={0,2500,...,11000},
+ xmax=12000,
ymax=35,
ytick={0,5,...,30},
scaled ticks=false,
axis lines=left,
width=6.5cm,
- height=5cm]
-\addplot[green,mark=square*,mark options={fill=white}] table {re2.data};
+ height=5cm,
+ legend entries={Our matcher},
+ legend pos=north east,
+ legend cell align=left]
+%\addplot[green,mark=square*,mark options={fill=white}] table {re2.data};
\addplot[black,mark=square*,mark options={fill=white}] table {re3.data};
\end{axis}
\end{tikzpicture}
\end{tabular}
\end{center}
-
-\begin{center}
-Graphs: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$
-\begin{tabular}{@{}cc@{}}
-\begin{tikzpicture}
-\begin{axis}[
- xlabel={$n$},
- x label style={at={(1.05,0.0)}},
- ylabel={time in secs},
- enlargelimits=false,
- xtick={0,5,...,30},
- xmax=33,
- ymax=35,
- ytick={0,5,...,30},
- scaled ticks=false,
- axis lines=left,
- width=5cm,
- height=5cm,
- legend entries={Java},
- legend pos=north west,
- legend cell align=left]
-\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
-\end{axis}
-\end{tikzpicture}
-&
-\begin{tikzpicture}
- \begin{axis}[
- xlabel={$n$},
- x label style={at={(1.05,0.0)}},
- ylabel={time in secs},
- enlargelimits=false,
- ymax=35,
- ytick={0,5,...,30},
- axis lines=left,
- scaled ticks=false,
- width=6.5cm,
- height=5cm]
-\addplot[green,mark=square*,mark options={fill=white}] table {re2a.data};
-\addplot[black,mark=square*,mark options={fill=white}] table {re3a.data};
-\end{axis}
-\end{tikzpicture}
-\end{tabular}
-\end{center}\medskip
+\bigskip
\noindent
-We will use these regular expressions and strings
-as running examples.
+In what follows we will use these regular expressions and strings as
+running examples. There will be several versions (V1, V2, V3,\ldots)
+of our matcher.\footnote{The corresponding files are
+ \texttt{re1.scala}, \texttt{re2.scala} and so on. As usual, you can
+ find the code on KEATS.}\bigskip
+\noindent
Having specified in the previous lecture what
problem our regular expression matcher is supposed to solve,
namely for any given regular expression $r$ and string $s$
@@ -125,7 +139,7 @@
s \in L(r)
\]
-\noindent we can look at an algorithm to solve this problem. Clearly
+\noindent we can look for an algorithm to solve this problem. Clearly
we cannot use the function $L$ directly for this, because in general
the set of strings $L$ returns is infinite (recall what $L(a^*)$ is).
In such cases there is no way we can implement an exhaustive test for
@@ -224,13 +238,14 @@
\end{equation}
\noindent If we can find an equivalent regular expression that is
-simpler (smaller for example), then this might potentially make our
-matching algorithm run faster. We can look for such a simpler regular
-expression $r'$ because whether a string $s$ is in $L(r)$ or in
-$L(r')$ with $r\equiv r'$ will always give the same answer. In the
-example above you will see that the regular expression is equivalent
-to just $r_1$. You can verify this by iteratively applying the
-simplification rules from above:
+simpler (that usually means smaller), then this might potentially make
+our matching algorithm run faster. We can look for such a simpler
+regular expression $r'$ because whether a string $s$ is in $L(r)$ or
+in $L(r')$ with $r\equiv r'$ will always give the same answer. Yes?
+
+In the example above you will see that the regular expression is
+equivalent to just $r_1$. You can verify this by iteratively applying
+the simplification rules from above:
\begin{center}
\begin{tabular}{ll}
@@ -251,6 +266,24 @@
$\ZERO$s, therefore simplifying them away will make the
algorithm quite a bit faster.
+Finally here are three equivalences between regular expressions which are
+not so obvious:
+
+\begin{center}
+\begin{tabular}{rcl}
+$r^*$ & $\equiv$ & $1 + r\cdot r^*$\\
+$(r_1 + r_2)^*$ & $\equiv$ & $r_1^* \cdot (r_2\cdot r_1^*)^*$\\
+$(r_1 \cdot r_2)^*$ & $\equiv$ & $1 + r_1\cdot (r_2 \cdot r_1)^* \cdot r_2$\\
+\end{tabular}
+\end{center}
+
+\noindent
+We will not use them in our algorithm, but feel free to convince you
+that they hold. As an aside, there has been a lot of research about
+questions like: Can one always decide when two regular expressions are
+equivalent or not? What does an algorithm look like to decide this
+efficiently?
+
\subsection*{The Matching Algorithm}
The algorithm we will define below consists of two parts. One
@@ -286,10 +319,10 @@
\emph{derivative} of a regular expression. This is a function
which will take a regular expression, say $r$, and a
character, say $c$, as arguments and returns a new regular
-expression. Be careful that the intuition behind this function
+expression. Be mindful that the intuition behind this function
is not so easy to grasp on first reading. Essentially this
function solves the following problem: if $r$ can match a
-string of the form $c\!::\!s$, what does the regular
+string of the form $c\!::\!s$, what does a regular
expression look like that can match just $s$? The definition
of this function is as follows:
@@ -343,9 +376,9 @@
and ``append'' $r^*$ in order to match the rest of $s$. Still
makes sense?
-If all this did not make sense yet, here is another way to rationalise
-the definition of $\textit{der}$ by considering the following operation
-on sets:
+If all this did not make sense yet, here is another way to explain the
+definition of $\textit{der}$ by considering the following operation on
+sets:
\begin{equation}\label{Der}
\textit{Der}\,c\,A\;\dn\;\{s\,|\,c\!::\!s \in A\}
@@ -425,7 +458,7 @@
\end{center}
\noindent This function iterates $\textit{der}$ taking one character at
-the time from the original string until it is exhausted.
+the time from the original string until the string is exhausted.
Having $\textit{der}s$ in place, we can finally define our matching
algorithm:
@@ -461,12 +494,16 @@
for \pcode{matches} are shown in Figure~\ref{scala1}.
\begin{figure}[p]
-\lstinputlisting{../progs/app5.scala}
-\caption{Scala implementation of the \textit{nullable} and
+ \lstinputlisting[numbers=left,linebackgroundcolor=
+ {\ifodd\value{lstnumber}\color{capri!3}\fi}]
+ {../progs/app5.scala}
+\caption{A Scala implementation of the \textit{nullable} and
derivative functions. These functions are easy to
- implement in functional languages, because their built-in pattern
+ implement in functional languages. This is because pattern
matching and recursion allow us to mimic the mathematical
- definitions very closely.\label{scala1}}
+ definitions very closely. Nearly all functional
+ programming languages support pattern matching and
+ recursion out of the box.\label{scala1}}
\end{figure}
@@ -507,8 +544,8 @@
For running the algorithm with our first example, the evil
regular expression $a^?{}^{\{n\}}a^{\{n\}}$, we need to implement
-the optional regular expression and the exactly $n$-times
-regular expression. This can be done with the translations
+the optional regular expression and the `exactly $n$-times
+regular expression'. This can be done with the translations
\lstinputlisting[numbers=none]{../progs/app51.scala}
@@ -541,8 +578,8 @@
\end{tikzpicture}
\end{center}
-\noindent Analysing this failure we notice that for
-$a^{\{n\}}$ we generate quite big regular expressions:
+\noindent Analysing this failure we notice that for $a^{\{n\}}$, for
+example, we generate quite big regular expressions:
\begin{center}
\begin{tabular}{rl}
@@ -560,18 +597,18 @@
large regular expressions will cause problems. This problem
is aggravated by $a^?$ being represented as $a + \ONE$.
-We can however fix this by having an explicit constructor for
+We can however fix this easily by having an explicit constructor for
$r^{\{n\}}$. In Scala we would introduce a constructor like
\begin{center}
\code{case class NTIMES(r: Rexp, n: Int) extends Rexp}
\end{center}
-\noindent With this fix we have a constant ``size'' regular
-expression for our running example no matter how large $n$ is.
-This means we have to also add cases for \pcode{NTIMES} in the
-functions $\textit{nullable}$ and $\textit{der}$. Does the change have any
-effect?
+\noindent With this fix we have a constant ``size'' regular expression
+for our running example no matter how large $n$ is (see the
+\texttt{size} section in the implementations). This means we have to
+also add cases for \pcode{NTIMES} in the functions $\textit{nullable}$
+and $\textit{der}$. Does the change have any effect?
\begin{center}
\begin{tikzpicture}
@@ -581,8 +618,8 @@
x label style={at={(1.01,0.0)}},
ylabel={time in secs},
enlargelimits=false,
- xtick={0,100,...,1000},
- xmax=1100,
+ xtick={0,200,...,1100},
+ xmax=1200,
ytick={0,5,...,30},
scaled ticks=false,
axis lines=left,
@@ -599,12 +636,12 @@
\end{tikzpicture}
\end{center}
-\noindent Now we are talking business! The modified matcher
-can within 30 seconds handle regular expressions up to
-$n = 950$ before a StackOverflow is raised. Recall that Python and Ruby
-(and our first version, Scala V1) could only handle $n = 27$ or so in 30
-seconds. There is no change for our second example
-$(a^*)^* \cdot b$---so this is still good.
+\noindent Now we are talking business! The modified matcher can within
+25 seconds handle regular expressions up to $n = 1,100$ before a
+StackOverflow is raised. Recall that Python and Ruby (and our first
+version, Scala V1) could only handle $n = 27$ or so in 30
+seconds. We have not tried our algorithm on the second example $(a^*)^* \cdot
+b$---but it is doing OK with it.
The moral is that our algorithm is rather sensitive to the
@@ -614,7 +651,7 @@
however, one more issue for making the algorithm run faster.
The derivative function often produces ``useless''
$\ZERO$s and $\ONE$s. To see this, consider $r = ((a
-\cdot b) + b)^*$ and the following two derivatives
+\cdot b) + b)^*$ and the following three derivatives
\begin{center}
\begin{tabular}{l}
@@ -625,9 +662,9 @@
\end{center}
\noindent
-If we simplify them according to the simple rules from the
-beginning, we can replace the right-hand sides by the
-smaller equivalent regular expressions
+If we simplify them according to the simplification rules from the
+beginning, we can replace the right-hand sides by the smaller
+equivalent regular expressions
\begin{center}
\begin{tabular}{l}
@@ -638,21 +675,23 @@
\end{center}
\noindent I leave it to you to contemplate whether such a
-simplification can have any impact on the correctness of our
-algorithm (will it change any answers?). Figure~\ref{scala2}
-gives a simplification function that recursively traverses a
-regular expression and simplifies it according to the rules
-given at the beginning. There are only rules for $+$, $\cdot$
-and $n$-times (the latter because we added it in the second
-version of our matcher). There is no rule for a star, because
-empirical data and also a little thought showed that
-simplifying under a star is a waste of computation time. The
-simplification function will be called after every derivation.
-This additional step removes all the ``junk'' the derivative
-function introduced. Does this improve the speed? You bet!!
+simplification can have any impact on the correctness of our algorithm
+(will it change any answers?). Figure~\ref{scala2} gives a
+simplification function that recursively traverses a regular
+expression and simplifies it according to the rules given at the
+beginning. There are only rules for $+$, $\cdot$ and $n$-times (the
+latter because we added it in the second version of our
+matcher). There is no simplification rule for a star, because
+empirical data and also a little thought showed that simplifying under
+a star is a waste of computation time. The simplification function
+will be called after every derivation. This additional step removes
+all the ``junk'' the derivative function introduced. Does this improve
+the speed? You bet!!
\begin{figure}[p]
-\lstinputlisting{../progs/app6.scala}
+\lstinputlisting[numbers=left,linebackgroundcolor=
+ {\ifodd\value{lstnumber}\color{capri!3}\fi}]
+ {../progs/app6.scala}
\caption{The simplification function and modified
\texttt{ders}-function; this function now
calls \texttt{der} first, but then simplifies
@@ -669,8 +708,8 @@
x label style={at={(1.04,0.0)}},
ylabel={time in secs},
enlargelimits=false,
- xtick={0,3000,...,9000},
- xmax=10000,
+ xtick={0,2500,...,10000},
+ xmax=12000,
ytick={0,5,...,30},
ymax=32,
scaled ticks=false,
@@ -687,12 +726,13 @@
\end{center}
\noindent
-To reacap, Python and Ruby needed approximately 30 seconds to match
-a string of 28 \texttt{a}s and the regular expression $a^{?\{n\}} \cdot a^{\{n\}}$.
-We need a third of this time to do the same with strings up to 12,000 \texttt{a}s.
-Similarly, Java needed 30 seconds to find out the regular expression
-$(a^*)^* \cdot b$ does not match the string of 28 \texttt{a}s. We can do
-the same in approximately 5 seconds for strings of 6000000 \texttt{a}s:
+To reacap, Python and Ruby needed approximately 30 seconds to match a
+string of 28 \texttt{a}s and the regular expression $a^{?\{n\}} \cdot
+a^{\{n\}}$. We need a third of this time to do the same with strings
+up to 11,000 \texttt{a}s. Similarly, Java and Python needed 30
+seconds to find out the regular expression $(a^*)^* \cdot b$ does not
+match the string of 28 \texttt{a}s. We can do the same in
+for strings composed of nearly 6,000,000 \texttt{a}s:
\begin{center}
@@ -700,20 +740,20 @@
\begin{axis}[
title={Graph: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$},
xlabel={$n$},
- x label style={at={(1.09,0.0)}},
ylabel={time in secs},
enlargelimits=false,
- xmax=7700000,
+ ymax=35,
ytick={0,5,...,30},
- ymax=32,
+ axis lines=left,
%scaled ticks=false,
- axis lines=left,
+ x label style={at={(1.09,0.0)}},
+ %xmax=7700000,
width=9cm,
height=5cm,
- legend entries={Scala V2, Scala V3},
+ legend entries={Scala V3},
legend pos=outer north east,
legend cell align=left]
-\addplot[green,mark=square*,mark options={fill=white}] table {re2a.data};
+%\addplot[green,mark=square*,mark options={fill=white}] table {re2a.data};
\addplot[black,mark=square*,mark options={fill=white}] table {re3a.data};
\end{axis}
\end{tikzpicture}
@@ -721,12 +761,11 @@
\subsection*{Epilogue}
-(23/Aug/2016) I recently found another place where this algorithm can be
-sped (this idea is not integrated with what is coming next,
-but I present it nonetheless). The idea is to define \texttt{ders}
-not such that it iterates the derivative character-by-character, but
-in bigger chunks. The resulting code for \texttt{ders2} looks as
-follows:
+(23/Aug/2016) I recently found another place where this algorithm can
+be sped up (this idea is not integrated with what is coming next, but
+I present it nonetheless). The idea is to not define \texttt{ders}
+that it iterates the derivative character-by-character, but in bigger
+chunks. The resulting code for \texttt{ders2} looks as follows:
\lstinputlisting[numbers=none]{../progs/app52.scala}
@@ -756,7 +795,7 @@
ymax=33,
%scaled ticks=false,
axis lines=left,
- width=5.5cm,
+ width=5.3cm,
height=5cm,
legend entries={Scala V3, Scala V4},
legend style={at={(0.1,-0.2)},anchor=north}]
@@ -772,12 +811,12 @@
x label style={at={(1.09,0.0)}},
ylabel={time in secs},
enlargelimits=false,
- xmax=8100000,
+ xmax=8200000,
ytick={0,5,...,30},
ymax=33,
%scaled ticks=false,
axis lines=left,
- width=5.5cm,
+ width=5.3cm,
height=5cm,
legend entries={Scala V3, Scala V4},
legend style={at={(0.1,-0.2)},anchor=north}]
@@ -793,7 +832,7 @@
You might not like doing proofs. But they serve a very
important purpose in Computer Science: How can we be sure that
-our algorithm matches its specification. We can try to test
+our algorithm matches its specification? We can try to test
the algorithm, but that often overlooks corner cases and an
exhaustive testing is impossible (since there are infinitely
many inputs). Proofs allow us to ensure that an algorithm
@@ -814,7 +853,7 @@
\end{tabular}
\end{center}
-\noindent If you want to show a property $P(r)$ for all
+\noindent If you want to show a property $P(r)$ for \emph{all}
regular expressions $r$, then you have to follow essentially
the recipe:
@@ -866,8 +905,9 @@
\label{propalt}
\end{equation}
-\noindent The difference to the base cases is that in this
-case we can already assume we proved
+\noindent The difference to the base cases is that in the inductive
+cases we can already assume we proved $P$ for the components, that is
+we can assume.
\begin{center}
\begin{tabular}{l}
@@ -876,9 +916,9 @@
\end{tabular}
\end{center}
-\noindent These are the induction hypotheses. To check this
+\noindent These are called the induction hypotheses. To check this
case, we can start from $\textit{nullable}(r_1 + r_2)$, which by
-definition is
+definition of $\textit{nullable}$ is
\[
\textit{nullable}(r_1) \vee \textit{nullable}(r_2)
@@ -901,18 +941,18 @@
[] \in L(r_1)\cup L(r_2)
\]
-\noindent but this is by definition of $L$ exactly $[] \in
-L(r_1 + r_2)$, which we needed to establish according to
+\noindent but this is by definition of $L$ exactly $[] \in L(r_1 +
+r_2)$, which we needed to establish according to statement in
\eqref{propalt}. What we have shown is that starting from
$\textit{nullable}(r_1 + r_2)$ we have done equivalent transformations
-to end up with $[] \in L(r_1 + r_2)$. Consequently we have
-established that $P(r_1 + r_2)$ holds.
+to end up with $[] \in L(r_1 + r_2)$. Consequently we have established
+that $P(r_1 + r_2)$ holds.
In order to complete the proof we would now need to look
at the cases \mbox{$P(r_1\cdot r_2)$} and $P(r^*)$. Again I let you
check the details.
-You might have to do induction proofs over strings.
+You might also have to do induction proofs over strings.
That means you want to establish a property $P(s)$ for all
strings $s$. For this remember strings are lists of
characters. These lists can be either the empty list or a
@@ -948,8 +988,9 @@
L(\textit{der}\,c\,r) = \textit{Der}\,c\,(L(r))
\]
-\noindent holds (this would be of course a property that
-needs to be proved in a side-lemma by induction on $r$).
+\noindent holds (this would be of course another property that needs
+to be proved in a side-lemma by induction on $r$). This is a bit
+more challenging, but not impossible.
To sum up, using reasoning like the one shown above allows us
to show the correctness of our algorithm. To see this,
@@ -968,9 +1009,9 @@
\label{dersstep}
\end{equation}
-\noindent But we have shown above in \eqref{dersprop}, that
-the $\textit{Ders}$ can be replaced by $L(\textit{ders}\ldots)$. That means
-\eqref{dersstep} is equivalent to
+\noindent You agree? But we have shown above in \eqref{dersprop},
+that the $\textit{Ders}$ can be replaced by
+$L(\textit{ders}\ldots)$. That means \eqref{dersstep} is equivalent to
\begin{equation}
[] \in L(\textit{ders}\,s\,r)
@@ -999,11 +1040,13 @@
s\in L(r)
\]
-\noindent which is the property we set out to prove:
-our algorithm meets its specification. To have done
-so, requires a few induction proofs about strings and
-regular expressions. Following the recipes is already a big
-step in performing these proofs.
+\noindent which is the property we set out to prove: our algorithm
+meets its specification. To have done so, requires a few induction
+proofs about strings and regular expressions. Following the \emph{induction
+recipes} is already a big step in actually performing these proofs.
+If you do not believe it, proofs have helped me to make sure my code
+is correct and in several instances prevented me of letting slip
+embarassing mistakes into the `wild'.
\end{document}
Binary file handouts/ho03.pdf has changed
--- a/handouts/ho03.tex Tue Sep 26 14:07:29 2017 +0100
+++ b/handouts/ho03.tex Tue Sep 26 14:08:49 2017 +0100
@@ -5,39 +5,49 @@
\begin{document}
+\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017}
-\section*{Handout 3 (Automata)}
+\section*{Handout 3 (Finite Automata)}
+
-Every formal language course I know of bombards you first with
-automata and then to a much, much smaller extend with regular
-expressions. As you can see, this course is turned upside
-down: regular expressions come first. The reason is that
-regular expressions are easier to reason about and the notion
-of derivatives, although already quite old, only became more
-widely known rather recently. Still let us in this lecture
-have a closer look at automata and their relation to regular
-expressions. This will help us with understanding why the
-regular expression matchers in Python, Ruby and Java are so slow
-with certain regular expressions. The central definition
-is:\medskip
+Every formal language and compiler course I know of bombards you first
+with automata and then to a much, much smaller extend with regular
+expressions. As you can see, this course is turned upside down:
+regular expressions come first. The reason is that regular expressions
+are easier to reason about and the notion of derivatives, although
+already quite old, only became more widely known rather
+recently. Still, let us in this lecture have a closer look at automata
+and their relation to regular expressions. This will help us with
+understanding why the regular expression matchers in Python, Ruby and
+Java are so slow with certain regular expressions. On the way we will
+also see what are the limitations of regular
+expressions. Unfortunately, they cannot be used for \emph{everything}.
+
+
+\subsection*{Deterministic Finite Automata}
+
+Lets start\ldots the central definition is:\medskip
\noindent
A \emph{deterministic finite automaton} (DFA), say $A$, is
-defined by a four-tuple written $A(Q, q_0, F, \delta)$ where
+given by a five-tuple written ${\cal A}(\varSigma, Qs, Q_0, F, \delta)$ where
\begin{itemize}
-\item $Q$ is a finite set of states,
-\item $q_0 \in Q$ is the start state,
-\item $F \subseteq Q$ are the accepting states, and
+\item $\varSigma$ is an alphabet,
+\item $Qs$ is a finite set of states,
+\item $Q_0 \in Qs$ is the start state,
+\item $F \subseteq Qs$ are the accepting states, and
\item $\delta$ is the transition function.
\end{itemize}
-\noindent The transition function determines how to
-``transition'' from one state to the next state with respect
-to a character. We have the assumption that these transition
-functions do not need to be defined everywhere: so it can be
-the case that given a character there is no next state, in
-which case we need to raise a kind of ``failure exception''. A
+\noindent I am sure you have seen this definition already
+before. The transition function determines how to ``transition'' from
+one state to the next state with respect to a character. We have the
+assumption that these transition functions do not need to be defined
+everywhere: so it can be the case that given a character there is no
+next state, in which case we need to raise a kind of ``failure
+exception''. That means actually we have \emph{partial} functions as
+transitions---see the Scala implementation for DFAs later on. A
typical example of a DFA is
\begin{center}
@@ -45,114 +55,324 @@
every state/.style={minimum size=0pt,
inner sep=2pt,draw=blue!50,very thick,
fill=blue!20},scale=2]
-\node[state,initial] (q_0) {$q_0$};
-\node[state] (q_1) [right=of q_0] {$q_1$};
-\node[state] (q_2) [below right=of q_0] {$q_2$};
-\node[state] (q_3) [right=of q_2] {$q_3$};
-\node[state, accepting] (q_4) [right=of q_1] {$q_4$};
-\path[->] (q_0) edge node [above] {$a$} (q_1);
-\path[->] (q_1) edge node [above] {$a$} (q_4);
-\path[->] (q_4) edge [loop right] node {$a, b$} ();
-\path[->] (q_3) edge node [right] {$a$} (q_4);
-\path[->] (q_2) edge node [above] {$a$} (q_3);
-\path[->] (q_1) edge node [right] {$b$} (q_2);
-\path[->] (q_0) edge node [above] {$b$} (q_2);
-\path[->] (q_2) edge [loop left] node {$b$} ();
-\path[->] (q_3) edge [bend left=95, looseness=1.3] node [below] {$b$} (q_0);
+\node[state,initial] (Q_0) {$Q_0$};
+\node[state] (Q_1) [right=of Q_0] {$Q_1$};
+\node[state] (Q_2) [below right=of Q_0] {$Q_2$};
+\node[state] (Q_3) [right=of Q_2] {$Q_3$};
+\node[state, accepting] (Q_4) [right=of Q_1] {$Q_4$};
+\path[->] (Q_0) edge node [above] {$a$} (Q_1);
+\path[->] (Q_1) edge node [above] {$a$} (Q_4);
+\path[->] (Q_4) edge [loop right] node {$a, b$} ();
+\path[->] (Q_3) edge node [right] {$a$} (Q_4);
+\path[->] (Q_2) edge node [above] {$a$} (Q_3);
+\path[->] (Q_1) edge node [right] {$b$} (Q_2);
+\path[->] (Q_0) edge node [above] {$b$} (Q_2);
+\path[->] (Q_2) edge [loop left] node {$b$} ();
+\path[->] (Q_3) edge [bend left=95, looseness=1.3] node [below] {$b$} (Q_0);
\end{tikzpicture}
\end{center}
-\noindent In this graphical notation, the accepting state
-$q_4$ is indicated with double circles. Note that there can be
-more than one accepting state. It is also possible that a DFA
-has no accepting states at all, or that the starting state is
-also an accepting state. In the case above the transition
-function is defined everywhere and can be given as a table as
-follows:
+\noindent In this graphical notation, the accepting state $Q_4$ is
+indicated with double circles. Note that there can be more than one
+accepting state. It is also possible that a DFA has no accepting
+state at all, or that the starting state is also an accepting
+state. In the case above the transition function is defined everywhere
+and can also be given as a table as follows:
\[
\begin{array}{lcl}
-(q_0, a) &\rightarrow& q_1\\
-(q_0, b) &\rightarrow& q_2\\
-(q_1, a) &\rightarrow& q_4\\
-(q_1, b) &\rightarrow& q_2\\
-(q_2, a) &\rightarrow& q_3\\
-(q_2, b) &\rightarrow& q_2\\
-(q_3, a) &\rightarrow& q_4\\
-(q_3, b) &\rightarrow& q_0\\
-(q_4, a) &\rightarrow& q_4\\
-(q_4, b) &\rightarrow& q_4\\
+(Q_0, a) &\rightarrow& Q_1\\
+(Q_0, b) &\rightarrow& Q_2\\
+(Q_1, a) &\rightarrow& Q_4\\
+(Q_1, b) &\rightarrow& Q_2\\
+(Q_2, a) &\rightarrow& Q_3\\
+(Q_2, b) &\rightarrow& Q_2\\
+(Q_3, a) &\rightarrow& Q_4\\
+(Q_3, b) &\rightarrow& Q_0\\
+(Q_4, a) &\rightarrow& Q_4\\
+(Q_4, b) &\rightarrow& Q_4\\
\end{array}
\]
+\noindent
+Please check that this table represents the same transition function
+as the graph above.
+
We need to define the notion of what language is accepted by
an automaton. For this we lift the transition function
$\delta$ from characters to strings as follows:
\[
\begin{array}{lcl}
-\hat{\delta}(q, []) & \dn & q\\
-\hat{\delta}(q, c\!::\!s) & \dn & \hat{\delta}(\delta(q, c), s)\\
+\widehat{\delta}(q, []) & \dn & q\\
+\widehat{\delta}(q, c\!::\!s) & \dn & \widehat{\delta}(\delta(q, c), s)\\
\end{array}
\]
\noindent This lifted transition function is often called
-``delta-hat''. Given a string, we start in the starting state
-and take the first character of the string, follow to the next
-sate, then take the second character and so on. Once the
-string is exhausted and we end up in an accepting state, then
-this string is accepted by the automaton. Otherwise it is not
-accepted. So $s$ is in the \emph{language accepted by the
-automaton} $A(Q, q_0, F, \delta)$ iff
+\emph{delta-hat}. Given a string, we start in the starting state and
+take the first character of the string, follow to the next state, then
+take the second character and so on. Once the string is exhausted and
+we end up in an accepting state, then this string is accepted by the
+automaton. Otherwise it is not accepted. This also means that if along
+the way we hit the case where the transition function $\delta$ is not
+defined, we need to raise an error. In our implementation we will deal
+with this case elegantly by using Scala's \texttt{Try}. Summing up: a
+string $s$ is in the \emph{language accepted by the automaton} ${\cal
+ A}(\varSigma, Q, Q_0, F, \delta)$ iff
\[
-\hat{\delta}(q_0, s) \in F
+\widehat{\delta}(Q_0, s) \in F
\]
-\noindent I let you think about a definition that describes
-the set of strings accepted by an automaton.
-
+\noindent I let you think about a definition that describes the set of
+all strings accepted by a deterministic finite automaton.
+
+\begin{figure}[p]
+\small
+\lstinputlisting[numbers=left]{../progs/display/dfa.scala}
+\caption{A Scala implementation of DFAs using partial functions.
+ Note some subtleties: \texttt{deltas} implements the delta-hat
+ construction by lifting the (partial) transition function to lists
+ of characters. Since \texttt{delta} is given as a partial function,
+ it can obviously go ``wrong'' in which case the \texttt{Try} in
+ \texttt{accepts} catches the error and returns \texttt{false}---that
+ means the string is not accepted. The example \texttt{delta} in
+ Line 28--38 implements the DFA example shown earlier in the
+ handout.\label{dfa}}
+\end{figure}
+
+My take on an implementation of DFAs in Scala is given in
+Figure~\ref{dfa}. As you can see, there are many features of the
+mathematical definition that are quite closely reflected in the
+code. In the DFA-class, there is a starting state, called
+\texttt{start}, with the polymorphic type \texttt{A}. There is a
+partial function \texttt{delta} for specifying the transitions---these
+partial functions take a state (of polymorphic type \texttt{A}) and an
+input (of polymorphic type \texttt{C}) and produce a new state (of
+type \texttt{A}). For the moment it is OK to assume that \texttt{A} is
+some arbitrary type for states and the input is just characters. (The
+reason for not having concrete types, but polymorphic types for the
+states and the input of DFAs will become clearer later on.)
+
+The DFA-class has also an argument for specifying final states. In the
+implementation it is not a set of states, as in the mathematical
+definition, but a function from states to booleans (this function is
+supposed to return true whenever a state is final; false
+otherwise). While this boolean function is different from the sets of
+states, Scala allows to use sets for such functions (see Line 40 where
+the DFA is initialised). Again it will become clear later on why I use
+functions for final states, rather than sets.
-While with DFAs it will always be clear that given a character
-what the next state is (potentially none), it will be useful
-to relax this restriction. That means we have several
-potential successor states. We even allow ``silent
-transitions'', also called epsilon-transitions. They allow us
-to go from one state to the next without having a character
-consumed. We label such silent transition with the letter
-$\epsilon$. The resulting construction is called a
-\emph{non-deterministic finite automaton} (NFA) given also as
-a four-tuple $A(Q, q_0, F, \rho)$ where
+The most important point in the implementation is that I use Scala's
+partial functions for representing the transitions; alternatives would
+have been \texttt{Maps} or even \texttt{Lists}. One of the main
+advantages of using partial functions is that transitions can be quite
+nicely defined by a series of \texttt{case} statements (see Lines 28
+-- 38 for an example). If you need to represent an automaton with a
+sink state (catch-all-state), you can use Scala's pattern matching and
+write something like
+
+{\small\begin{lstlisting}[language=Scala]
+abstract class State
+...
+case object Sink extends State
+
+val delta : (State, Char) :=> State =
+ { case (S0, 'a') => S1
+ case (S1, 'a') => S2
+ case _ => Sink
+ }
+\end{lstlisting}}
+
+\noindent I let you think what the corresponding DFA looks like in the
+graph notation. Also, I suggest you to tinker with the Scala code in
+order to define the DFA that does not accept any string at all.
+
+Finally, I let you ponder whether this is a good implementation of
+DFAs or not. In doing so I hope you notice that the $\varSigma$ and
+$Qs$ components (the alphabet and the set of finite states,
+respectively) are missing from the class definition. This means that
+the implementation allows you to do some fishy things you are not
+meant to do with DFAs. Which fishy things could that be?
+
+
+
+\subsection*{Non-Deterministic Finite Automata}
+
+Remember we want to find out what the relation is between regular
+expressions and automata. To do this with DFAs is a bit unwieldy.
+While with DFAs it is always clear that given a state and a character
+what the next state is (potentially none), it will be convenient to
+relax this restriction. That means we allow states to have several
+potential successor states. We even allow more than one starting
+state. The resulting construction is called a \emph{Non-Deterministic
+ Finite Automaton} (NFA) given also as a five-tuple ${\cal
+ A}(\varSigma, Qs, Q_{0s}, F, \rho)$ where
\begin{itemize}
-\item $Q$ is a finite set of states
-\item $q_0$ is a start state
-\item $F$ are some accepting states with $F \subseteq Q$, and
+\item $\varSigma$ is an alphabet,
+\item $Qs$ is a finite set of states
+\item $Q_{0s}$ is a set of start states ($Q_{0s} \subseteq Qs$)
+\item $F$ are some accepting states with $F \subseteq Qs$, and
\item $\rho$ is a transition relation.
\end{itemize}
\noindent
-Two typical examples of NFAs are
+A typical example of a NFA is
+
+% A NFA for (ab* + b)*a
+\begin{center}
+\begin{tikzpicture}[>=stealth',very thick, auto,
+ every state/.style={minimum size=0pt,inner sep=3pt,
+ draw=blue!50,very thick,fill=blue!20},scale=2]
+\node[state,initial] (Q_0) {$Q_0$};
+\node[state] (Q_1) [right=of Q_0] {$Q_1$};
+\node[state, accepting] (Q_2) [right=of Q_1] {$Q_2$};
+\path[->] (Q_0) edge [loop above] node {$b$} ();
+\path[<-] (Q_0) edge node [below] {$b$} (Q_1);
+\path[->] (Q_0) edge [bend left] node [above] {$a$} (Q_1);
+\path[->] (Q_0) edge [bend right] node [below] {$a$} (Q_2);
+\path[->] (Q_1) edge [loop above] node {$a,b$} ();
+\path[->] (Q_1) edge node [above] {$a$} (Q_2);
+\end{tikzpicture}
+\end{center}
+
+\noindent
+This NFA happens to have only one starting state, but in general there
+could be more than one. Notice that in state $Q_0$ we might go to
+state $Q_1$ \emph{or} to state $Q_2$ when receiving an $a$. Similarly
+in state $Q_1$ and receiving an $a$, we can stay in state $Q_1$
+\emph{or} go to $Q_2$. This kind of choice is not allowed with
+DFAs. The downside of this choice in NFAs is that when it comes to
+deciding whether a string is accepted by a NFA we potentially have to
+explore all possibilities. I let you think which strings the above NFA
+accepts.
+
+
+There are a number of additional points you should note about
+NFAs. Every DFA is a NFA, but not vice versa. The $\rho$ in NFAs is a
+transition \emph{relation} (DFAs have a transition function). The
+difference between a function and a relation is that a function has
+always a single output, while a relation gives, roughly speaking,
+several outputs. Look again at the NFA above: if you are currently in
+the state $Q_1$ and you read a character $b$, then you can transition
+to either $Q_0$ \emph{or} $Q_2$. Which route, or output, you take is
+not determined. This non-determinism can be represented by a
+relation.
+
+My implementation of NFAs in Scala is shown in Figure~\ref{nfa}.
+Perhaps interestingly, I do not actually use relations for my NFAs,
+but use transition functions that return sets of states. DFAs have
+partial transition functions that return a single state; my NFAs
+return a set of states. I let you think about this representation for
+NFA-transitions and how it corresponds to the relations used in the
+mathematical definition of NFAs. An example of a transition function
+in Scala for the NFA shown above is
+
+{\small\begin{lstlisting}[language=Scala]
+val nfa_delta : (State, Char) :=> Set[State] =
+ { case (Q0, 'a') => Set(Q1, Q2)
+ case (Q0, 'b') => Set(Q0)
+ case (Q1, 'a') => Set(Q1, Q2)
+ case (Q1, 'b') => Set(Q0, Q1) }
+\end{lstlisting}}
+
+Like in the mathematical definition, \texttt{starts} is in
+NFAs a set of states; \texttt{fins} is again a function from states to
+booleans. The \texttt{next} function calculates the set of next states
+reachable from a single state \texttt{q} by a character~\texttt{c}. In
+case there is no such state---the partial transition function is
+undefined---the empty set is returned (see function
+\texttt{applyOrElse} in Lines 9 and 10). The function \texttt{nexts}
+just lifts this function to sets of states.
+
+\begin{figure}[p]
+\small
+\lstinputlisting[numbers=left]{../progs/display/nfa.scala}
+\caption{A Scala implementation of NFAs using partial functions.
+ Notice that the function \texttt{accepts} implements the
+ acceptance of a string in a breath-first search fashion. This can be a costly
+ way of deciding whether a string is accepted or not in applications that need to handle
+ large NFAs and large inputs.\label{nfa}}
+\end{figure}
+
+Look very careful at the \texttt{accepts} and \texttt{deltas}
+functions in NFAs and remember that when accepting a string by a NFA
+we might have to explore all possible transitions (recall which state
+to go to is not unique anymore with NFAs\ldots{}we need to explore
+potentially all next states). The implementation achieves this
+exploration through a \emph{breadth-first search}. This is fine for
+small NFAs, but can lead to real memory problems when the NFAs are
+bigger and larger strings need to be processed. As result, some
+regular expression matching engines resort to a \emph{depth-first
+ search} with \emph{backtracking} in unsuccessful cases. In our
+implementation we can implement a depth-first version of
+\texttt{accepts} using Scala's \texttt{exists}-function as follows:
+
+
+{\small\begin{lstlisting}[language=Scala]
+def search(q: A, s: List[C]) : Boolean = s match {
+ case Nil => fins(q)
+ case c::cs => next(q, c).exists(search(_, cs))
+}
+
+def accepts2(s: List[C]) : Boolean =
+ starts.exists(search(_, s))
+\end{lstlisting}}
+
+\noindent
+This depth-first way of exploration seems to work quite efficiently in
+many examples and is much less of a strain on memory. The problem is
+that the backtracking can get ``catastrophic'' in some
+examples---remember the catastrophic backtracking from earlier
+lectures. This depth-first search with backtracking is the reason for
+the abysmal performance of some regular expression matchings in Java,
+Ruby and Python. I like to show you this in the next two sections.
+
+
+\subsection*{Epsilon NFAs}
+
+In order to get an idea what calculations are performed by Java \&
+friends, we need a method for transforming a regular expression into
+an automaton. This automaton should accept exactly those strings that
+are accepted by the regular expression. The simplest and most
+well-known method for this is called \emph{Thompson Construction},
+after the Turing Award winner Ken Thompson. This method is by
+recursion over regular expressions and depends on the non-determinism
+in NFAs described in the previous section. You will see shortly why
+this construction works well with NFAs, but is not so straightforward
+with DFAs.
+
+Unfortunately we are still one step away from our intended target
+though---because this construction uses a version of NFAs that allows
+``silent transitions''. The idea behind silent transitions is that
+they allow us to go from one state to the next without having to
+consume a character. We label such silent transition with the letter
+$\epsilon$ and call the automata $\epsilon$NFAs. Two typical examples
+of $\epsilon$NFAs are:
+
+
\begin{center}
\begin{tabular}[t]{c@{\hspace{9mm}}c}
-\begin{tikzpicture}[scale=0.7,>=stealth',very thick,
- every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]
-\node[state,initial] (q_0) {$q_0$};
-\node[state] (q_1) [above=of q_0] {$q_1$};
-\node[state, accepting] (q_2) [below=of q_0] {$q_2$};
-\path[->] (q_0) edge node [left] {$\epsilon$} (q_1);
-\path[->] (q_0) edge node [left] {$\epsilon$} (q_2);
-\path[->] (q_0) edge [loop right] node {$a$} ();
-\path[->] (q_1) edge [loop above] node {$a$} ();
-\path[->] (q_2) edge [loop below] node {$b$} ();
+\begin{tikzpicture}[>=stealth',very thick,
+ every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]
+\node[state,initial] (Q_0) {$Q_0$};
+\node[state] (Q_1) [above=of Q_0] {$Q_1$};
+\node[state, accepting] (Q_2) [below=of Q_0] {$Q_2$};
+\path[->] (Q_0) edge node [left] {$\epsilon$} (Q_1);
+\path[->] (Q_0) edge node [left] {$\epsilon$} (Q_2);
+\path[->] (Q_0) edge [loop right] node {$a$} ();
+\path[->] (Q_1) edge [loop right] node {$a$} ();
+\path[->] (Q_2) edge [loop right] node {$b$} ();
\end{tikzpicture} &
\raisebox{20mm}{
-\begin{tikzpicture}[scale=0.7,>=stealth',very thick,
- every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]
-\node[state,initial] (r_1) {$r_1$};
-\node[state] (r_2) [above=of r_1] {$r_2$};
-\node[state, accepting] (r_3) [right=of r_1] {$r_3$};
+\begin{tikzpicture}[>=stealth',very thick,
+ every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]
+\node[state,initial] (r_1) {$R_1$};
+\node[state] (r_2) [above=of r_1] {$R_2$};
+\node[state, accepting] (r_3) [right=of r_1] {$R_3$};
\path[->] (r_1) edge node [below] {$b$} (r_3);
\path[->] (r_2) edge [bend left] node [above] {$a$} (r_3);
\path[->] (r_1) edge [bend left] node [left] {$\epsilon$} (r_2);
@@ -161,121 +381,309 @@
\end{tabular}
\end{center}
-\noindent There are, however, a number of points you should
-note. Every DFA is a NFA, but not vice versa. The $\rho$ in
-NFAs is a transition \emph{relation} (DFAs have a transition
-function). The difference between a function and a relation is
-that a function has always a single output, while a relation
-gives, roughly speaking, several outputs. Look at the NFA on
-the right-hand side above: if you are currently in the state
-$r_2$ and you read a character $a$, then you can transition to
-either $r_1$ \emph{or} $r_3$. Which route you take is not
-determined. This means if we need to decide whether a string
-is accepted by a NFA, we might have to explore all
-possibilities. Also there is the special silent transition in
-NFAs. As mentioned already this transition means you do not
-have to ``consume'' any part of the input string, but
-``silently'' change to a different state. In the left picture,
-for example, if you are in the starting state, you can
-silently move either to $q_1$ or $q_2$. This silent
-transition is also often called \emph{$\epsilon$-transition}.
+\noindent
+Consider the $\epsilon$NFA on the left-hand side: the
+$\epsilon$-transitions mean you do not have to ``consume'' any part of
+the input string, but ``silently'' change to a different state. In
+this example, if you are in the starting state $Q_0$, you can silently
+move either to $Q_1$ or $Q_2$. You can see that once you are in $Q_1$,
+respectively $Q_2$, you cannot ``go back'' to the other states. So it
+seems allowing $\epsilon$-transitions is a rather substantial
+extension to NFAs. On first appearances, $\epsilon$-transitions might
+even look rather strange, or even silly. To start with, silent
+transitions make the decision whether a string is accepted by an
+automaton even harder: with $\epsilon$NFAs we have to look whether we
+can do first some $\epsilon$-transitions and then do a
+``proper''-transition; and after any ``proper''-transition we again
+have to check whether we can do again some silent transitions. Even
+worse, if there is a silent transition pointing back to the same
+state, then we have to be careful our decision procedure for strings
+does not loop (remember the depth-first search for exploring all
+states).
+
+The obvious question is: Do we get anything in return for this hassle
+with silent transitions? Well, we still have to work for it\ldots
+unfortunately. If we were to follow the many textbooks on the
+subject, we would now start with defining what $\epsilon$NFAs
+are---that would require extending the transition relation of
+NFAs. Next, we would show that the $\epsilon$NFAs are equivalent to
+NFAs and so on. Once we have done all this on paper, we would need to
+implement $\epsilon$NFAs. Lets try to take a shortcut instead. We are
+not really interested in $\epsilon$NFAs; they are only a convenient
+tool for translating regular expressions into automata. So we are not
+going to implementing them explicitly, but translate them immediately
+into NFAs (in a sense $\epsilon$NFAs are just a convenient API for
+lazy people ;o). How does the translation work? Well we have to find
+all transitions of the form
+
+\[
+q\stackrel{\epsilon}{\longrightarrow}\ldots\stackrel{\epsilon}{\longrightarrow}
+\;\stackrel{a}{\longrightarrow}\;
+\stackrel{\epsilon}{\longrightarrow}\ldots\stackrel{\epsilon}{\longrightarrow} q'
+\]
+
+\noindent where somewhere in the ``middle'' is an $a$-transition. We
+replace them with $q \stackrel{a}{\longrightarrow} q'$. Doing this to
+the $\epsilon$NFA on the right-hand side above gives the NFA
+
+\begin{center}
+\begin{tikzpicture}[>=stealth',very thick,
+ every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]
+\node[state,initial] (r_1) {$R_1$};
+\node[state] (r_2) [above=of r_1] {$R_2$};
+\node[state, accepting] (r_3) [right=of r_1] {$R_3$};
+\path[->] (r_1) edge node [above] {$b$} (r_3);
+\path[->] (r_2) edge [bend left] node [above] {$a$} (r_3);
+\path[->] (r_1) edge [bend left] node [left] {$a$} (r_2);
+\path[->] (r_2) edge [bend left] node [right] {$a$} (r_1);
+\path[->] (r_1) edge [loop below] node {$a$} ();
+\path[->] (r_1) edge [bend right] node [below] {$a$} (r_3);
+\end{tikzpicture}
+\end{center}
+
+\noindent where the single $\epsilon$-transition is replaced by
+three additional $a$-transitions. Please do the calculations yourself
+and verify that I did not forget any transition.
+
+So in what follows, whenever we are given an $\epsilon$NFA we will
+replace it by an equivalent NFA. The Scala code for this translation
+is given in Figure~\ref{enfa}. The main workhorse in this code is a
+function that calculates a fixpoint of function (Lines 5--10). This
+function is used for ``discovering'' which states are reachable by
+$\epsilon$-transitions. Once no new state is discovered, a fixpoint is
+reached. This is used for example when calculating the starting states
+of an equivalent NFA (see Line 36): we start with all starting states
+of the $\epsilon$NFA and then look for all additional states that can
+be reached by $\epsilon$-transitions. We keep on doing this until no
+new state can be reached. This is what the $\epsilon$-closure, named
+in the code \texttt{ecl}, calculates. Similarly, an accepting state of
+the NFA is when we can reach an accepting state of the $\epsilon$NFA
+by $\epsilon$-transitions.
-\subsubsection*{Thompson Construction}
+\begin{figure}[p]
+\small
+\lstinputlisting[numbers=left]{../progs/display/enfa.scala}
+
+\caption{A Scala function that translates $\epsilon$NFA into NFAs. The
+ transition function of $\epsilon$NFA takes as input an \texttt{Option[C]}.
+ \texttt{None} stands for an $\epsilon$-transition; \texttt{Some(c)}
+ for a ``proper'' transition consuming a character. The functions in
+ Lines 18--26 calculate
+ all states reachable by one or more $\epsilon$-transition for a given
+ set of states. The NFA is constructed in Lines 36--38.
+ Note the interesting commands in Lines 5 and 6: their purpose is
+ to ensure that \texttt{fixpT} is the tail-recursive version of
+ the fixpoint construction; otherwise we would quickly get a
+ stack-overflow exception, even on small examples, due to limitations
+ of the JVM.
+ \label{enfa}}
+\end{figure}
+
+Also look carefully how the transitions of $\epsilon$NFAs are
+implemented. The additional possibility of performing silent
+transitions is encoded by using \texttt{Option[C]} as the type for the
+``input''. The \texttt{Some}s stand for ``proper'' transitions where
+a character is consumed; \texttt{None} stands for
+$\epsilon$-transitions. The transition functions for the two
+$\epsilon$NFAs from the beginning of this section can be defined as
-The reason for introducing NFAs is that there is a relatively
-simple (recursive) translation of regular expressions into
-NFAs. Consider the simple regular expressions $\ZERO$,
-$\ONE$ and $c$. They can be translated as follows:
+{\small\begin{lstlisting}[language=Scala]
+val enfa_trans1 : (State, Option[Char]) :=> Set[State] =
+ { case (Q0, Some('a')) => Set(Q0)
+ case (Q0, None) => Set(Q1, Q2)
+ case (Q1, Some('a')) => Set(Q1)
+ case (Q2, Some('b')) => Set(Q2) }
+
+val enfa_trans2 : (State, Option[Char]) :=> Set[State] =
+ { case (R1, Some('b')) => Set(R3)
+ case (R1, None) => Set(R2)
+ case (R2, Some('a')) => Set(R1, R3) }
+\end{lstlisting}}
-\begin{center}
+\noindent
+I hope you agree now with my earlier statement that the $\epsilon$NFAs
+are just an API for NFAs.
+
+\subsection*{Thompson Construction}
+
+Having the translation of $\epsilon$NFAs to NFAs in place, we can
+finally return to the problem of translating regular expressions into
+equivalent NFAs. Recall that by equivalent we mean that the NFAs
+recognise the same language. Consider the simple regular expressions
+$\ZERO$, $\ONE$ and $c$. They can be translated into equivalent NFAs
+as follows:
+
+\begin{equation}\mbox{
\begin{tabular}[t]{l@{\hspace{10mm}}l}
\raisebox{1mm}{$\ZERO$} &
\begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
-\node[state, initial] (q_0) {$\mbox{}$};
+\node[state, initial] (Q_0) {$\mbox{}$};
\end{tikzpicture}\\\\
\raisebox{1mm}{$\ONE$} &
\begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
-\node[state, initial, accepting] (q_0) {$\mbox{}$};
+\node[state, initial, accepting] (Q_0) {$\mbox{}$};
\end{tikzpicture}\\\\
-\raisebox{2mm}{$c$} &
+\raisebox{3mm}{$c$} &
\begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
-\node[state, initial] (q_0) {$\mbox{}$};
-\node[state, accepting] (q_1) [right=of q_0] {$\mbox{}$};
-\path[->] (q_0) edge node [below] {$c$} (q_1);
-\end{tikzpicture}\\\\
-\end{tabular}
-\end{center}
+\node[state, initial] (Q_0) {$\mbox{}$};
+\node[state, accepting] (Q_1) [right=of Q_0] {$\mbox{}$};
+\path[->] (Q_0) edge node [below] {$c$} (Q_1);
+\end{tikzpicture}\\
+\end{tabular}}\label{simplecases}
+\end{equation}
+
+\noindent
+I let you think whether the NFAs can match exactly those strings the
+regular expressions can match. To do this translation in code we need
+a way to construct states ``programatically''...and as an additional
+constraint Scala needs to recognise that these states are being distinct.
+For this I implemented in Figure~\ref{thompson1} a class
+\texttt{TState} that includes a counter and a companion object that
+increases this counter whenever a new state is created.\footnote{You might
+ have to read up what \emph{companion objects} do in Scala.}
-\noindent The case for the sequence regular expression $r_1
-\cdot r_2$ is as follows: We are given by recursion two
-automata representing $r_1$ and $r_2$ respectively.
+\begin{figure}[p]
+\small
+\lstinputlisting[numbers=left]{../progs/display/thompson1.scala}
+\caption{The first part of the Thompson Construction. Lines 7--16
+ implement a way of how to create new states that are all
+ distinct by virtue of a counter. This counter is
+ increased in the companion object of \texttt{TState}
+ whenever a new state is created. The code in Lines 24--40
+ constructs NFAs for the simple regular expressions $\ZERO$, $\ONE$ and $c$.
+ Compare this code with the pictures given in \eqref{simplecases} on
+ Page~\pageref{simplecases}.
+ \label{thompson1}}
+\end{figure}
+
+\begin{figure}[p]
+\small
+\lstinputlisting[numbers=left]{../progs/display/thompson2.scala}
+\caption{The second part of the Thompson Construction implementing
+ the composition of NFAs according to $\cdot$, $+$ and ${}^*$.
+ The implicit class about rich partial functions
+ implements the infix operation \texttt{+++} which
+ combines an $\epsilon$NFA transition with a NFA transition
+ (both are given as partial functions---but with different type!).\label{thompson2}}
+\end{figure}
+
+The case for the sequence regular expression $r_1 \cdot r_2$ is a bit more
+complicated: Say, we are given by recursion two NFAs representing the regular
+expressions $r_1$ and $r_2$ respectively.
\begin{center}
\begin{tikzpicture}[node distance=3mm,
- >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
-\node[state, initial] (q_0) {$\mbox{}$};
-\node (r_1) [right=of q_0] {$\ldots$};
-\node[state, accepting] (t_1) [right=of r_1] {$\mbox{}$};
-\node[state, accepting] (t_2) [above=of t_1] {$\mbox{}$};
-\node[state, accepting] (t_3) [below=of t_1] {$\mbox{}$};
-\node[state, initial] (a_0) [right=2.5cm of t_1] {$\mbox{}$};
-\node (b_1) [right=of a_0] {$\ldots$};
+ >=stealth',very thick,
+ every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
+\node[state, initial] (Q_0) {$\mbox{}$};
+\node[state, initial] (Q_01) [below=1mm of Q_0] {$\mbox{}$};
+\node[state, initial] (Q_02) [above=1mm of Q_0] {$\mbox{}$};
+\node (R_1) [right=of Q_0] {$\ldots$};
+\node[state, accepting] (T_1) [right=of R_1] {$\mbox{}$};
+\node[state, accepting] (T_2) [above=of T_1] {$\mbox{}$};
+\node[state, accepting] (T_3) [below=of T_1] {$\mbox{}$};
+
+\node (A_0) [right=2.5cm of T_1] {$\mbox{}$};
+\node[state, initial] (A_01) [above=1mm of A_0] {$\mbox{}$};
+\node[state, initial] (A_02) [below=1mm of A_0] {$\mbox{}$};
+
+\node (b_1) [right=of A_0] {$\ldots$};
\node[state, accepting] (c_1) [right=of b_1] {$\mbox{}$};
\node[state, accepting] (c_2) [above=of c_1] {$\mbox{}$};
\node[state, accepting] (c_3) [below=of c_1] {$\mbox{}$};
\begin{pgfonlayer}{background}
-\node (1) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (q_0) (r_1) (t_1) (t_2) (t_3)] {};
-\node (2) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (a_0) (b_1) (c_1) (c_2) (c_3)] {};
+ \node (1) [rounded corners, inner sep=1mm, thick,
+ draw=black!60, fill=black!20, fit= (Q_0) (R_1) (T_1) (T_2) (T_3)] {};
+ \node (2) [rounded corners, inner sep=1mm, thick,
+ draw=black!60, fill=black!20, fit= (A_0) (b_1) (c_1) (c_2) (c_3)] {};
\node [yshift=2mm] at (1.north) {$r_1$};
\node [yshift=2mm] at (2.north) {$r_2$};
\end{pgfonlayer}
\end{tikzpicture}
\end{center}
-\noindent The first automaton has some accepting states. We
-obtain an automaton for $r_1\cdot r_2$ by connecting these
-accepting states with $\epsilon$-transitions to the starting
-state of the second automaton. By doing so we make them
-non-accepting like so:
+\noindent The first NFA has some accepting states and the second some
+starting states. We obtain an $\epsilon$NFA for $r_1\cdot r_2$ by
+connecting the accepting states of the first NFA with
+$\epsilon$-transitions to the starting states of the second
+automaton. By doing so we make the accepting states of the first NFA
+to be non-accepting like so:
\begin{center}
\begin{tikzpicture}[node distance=3mm,
- >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
-\node[state, initial] (q_0) {$\mbox{}$};
-\node (r_1) [right=of q_0] {$\ldots$};
+ >=stealth',very thick,
+ every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
+\node[state, initial] (Q_0) {$\mbox{}$};
+\node[state, initial] (Q_01) [below=1mm of Q_0] {$\mbox{}$};
+\node[state, initial] (Q_02) [above=1mm of Q_0] {$\mbox{}$};
+\node (r_1) [right=of Q_0] {$\ldots$};
\node[state] (t_1) [right=of r_1] {$\mbox{}$};
\node[state] (t_2) [above=of t_1] {$\mbox{}$};
\node[state] (t_3) [below=of t_1] {$\mbox{}$};
-\node[state] (a_0) [right=2.5cm of t_1] {$\mbox{}$};
-\node (b_1) [right=of a_0] {$\ldots$};
+
+\node (A_0) [right=2.5cm of t_1] {$\mbox{}$};
+\node[state] (A_01) [above=1mm of A_0] {$\mbox{}$};
+\node[state] (A_02) [below=1mm of A_0] {$\mbox{}$};
+
+\node (b_1) [right=of A_0] {$\ldots$};
\node[state, accepting] (c_1) [right=of b_1] {$\mbox{}$};
\node[state, accepting] (c_2) [above=of c_1] {$\mbox{}$};
\node[state, accepting] (c_3) [below=of c_1] {$\mbox{}$};
-\path[->] (t_1) edge node [above, pos=0.3] {$\epsilon$} (a_0);
-\path[->] (t_2) edge node [above] {$\epsilon$} (a_0);
-\path[->] (t_3) edge node [below] {$\epsilon$} (a_0);
-
+\path[->] (t_1) edge (A_01);
+\path[->] (t_2) edge node [above] {$\epsilon$s} (A_01);
+\path[->] (t_3) edge (A_01);
+\path[->] (t_1) edge (A_02);
+\path[->] (t_2) edge (A_02);
+\path[->] (t_3) edge node [below] {$\epsilon$s} (A_02);
+
\begin{pgfonlayer}{background}
-\node (3) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (q_0) (c_1) (c_2) (c_3)] {};
+ \node (3) [rounded corners, inner sep=1mm, thick,
+ draw=black!60, fill=black!20, fit= (Q_0) (c_1) (c_2) (c_3)] {};
\node [yshift=2mm] at (3.north) {$r_1\cdot r_2$};
\end{pgfonlayer}
\end{tikzpicture}
\end{center}
-\noindent The case for the choice regular expression $r_1 +
-r_2$ is slightly different: We are given by recursion two
-automata representing $r_1$ and $r_2$ respectively.
+\noindent The idea behind this construction is that the start of any
+string is first recognised by the first NFA, then we silently change
+to the second NFA; the ending of the string is recognised by the
+second NFA...just like matching of a string by the regular expression
+$r_1\cdot r_2$. The Scala code for this construction is given in
+Figure~\ref{thompson2} in Lines 16--23. The starting states of the
+$\epsilon$NFA are the starting states of the first NFA (corresponding
+to $r_1$); the accepting function is the accepting function of the
+second NFA (corresponding to $r_2$). The new transition function is
+all the ``old'' transitions plus the $\epsilon$-transitions connecting
+the accepting states of the first NFA to the starting states of the
+first NFA (Lines 18 and 19). The $\epsilon$NFA is then immediately
+translated in a NFA.
+
+
+The case for the alternative regular expression $r_1 + r_2$ is
+slightly different: We are given by recursion two NFAs representing
+$r_1$ and $r_2$ respectively. Each NFA has some starting states and
+some accepting states. We obtain a NFA for the regular expression $r_1
++ r_2$ by composing the transition functions (this crucially depends
+on knowing that the states of each component NFA are distinct---recall
+we implemented for this to hold some bespoke code for states). We also
+need to combine the starting states and accepting functions
+appropriately.
\begin{center}
+\begin{tabular}[t]{ccc}
\begin{tikzpicture}[node distance=3mm,
- >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
+ >=stealth',very thick,
+ every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},
+ baseline=(current bounding box.center)]
\node at (0,0) (1) {$\mbox{}$};
-\node[state, initial] (2) [above right=16mm of 1] {$\mbox{}$};
-\node[state, initial] (3) [below right=16mm of 1] {$\mbox{}$};
+\node (2) [above=10mm of 1] {};
+\node[state, initial] (4) [above=1mm of 2] {$\mbox{}$};
+\node[state, initial] (5) [below=1mm of 2] {$\mbox{}$};
+\node[state, initial] (3) [below=10mm of 1] {$\mbox{}$};
-\node (a) [right=of 2] {$\ldots$};
-\node[state, accepting] (a1) [right=of a] {$\mbox{}$};
+\node (a) [right=of 2] {$\ldots\,$};
+\node (a1) [right=of a] {$$};
\node[state, accepting] (a2) [above=of a1] {$\mbox{}$};
\node[state, accepting] (a3) [below=of a1] {$\mbox{}$};
@@ -290,23 +698,21 @@
\node [yshift=3mm] at (2.north) {$r_2$};
\end{pgfonlayer}
\end{tikzpicture}
-\end{center}
-
-\noindent Each automaton has a single start state and
-potentially several accepting states. We obtain a NFA for the
-regular expression $r_1 + r_2$ by introducing a new starting
-state and connecting it with an $\epsilon$-transition to the
-two starting states above, like so
+&
+\mbox{}\qquad\tikz{\draw[>=stealth,line width=2mm,->] (0,0) -- (1, 0)}\quad\mbox{}
+&
+\begin{tikzpicture}[node distance=3mm,
+ >=stealth',very thick,
+ every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},
+ baseline=(current bounding box.center)]
+\node at (0,0) (1) {$\mbox{}$};
+\node (2) [above=10mm of 1] {$$};
+\node[state, initial] (4) [above=1mm of 2] {$\mbox{}$};
+\node[state, initial] (5) [below=1mm of 2] {$\mbox{}$};
+\node[state, initial] (3) [below=10mm of 1] {$\mbox{}$};
-\begin{center}
-\hspace{2cm}\begin{tikzpicture}[node distance=3mm,
- >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
-\node at (0,0) [state, initial] (1) {$\mbox{}$};
-\node[state] (2) [above right=16mm of 1] {$\mbox{}$};
-\node[state] (3) [below right=16mm of 1] {$\mbox{}$};
-
-\node (a) [right=of 2] {$\ldots$};
-\node[state, accepting] (a1) [right=of a] {$\mbox{}$};
+\node (a) [right=of 2] {$\ldots\,$};
+\node (a1) [right=of a] {$$};
\node[state, accepting] (a2) [above=of a1] {$\mbox{}$};
\node[state, accepting] (a3) [below=of a1] {$\mbox{}$};
@@ -315,24 +721,34 @@
\node[state, accepting] (b2) [above=of b1] {$\mbox{}$};
\node[state, accepting] (b3) [below=of b1] {$\mbox{}$};
-\path[->] (1) edge node [above] {$\epsilon$} (2);
-\path[->] (1) edge node [below] {$\epsilon$} (3);
+%\path[->] (1) edge node [above] {$\epsilon$} (2);
+%\path[->] (1) edge node [below] {$\epsilon$} (3);
\begin{pgfonlayer}{background}
\node (3) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (1) (a2) (a3) (b2) (b3)] {};
\node [yshift=3mm] at (3.north) {$r_1+ r_2$};
\end{pgfonlayer}
\end{tikzpicture}
+\end{tabular}
\end{center}
-\noindent
-Finally for the $*$-case we have an automaton for $r$
+\noindent The code for this construction is in Figure~\ref{thompson2}
+in Lines 25--33.
+
+Finally for the $*$-case we have a NFA for $r$ and connect its
+accepting states to a new starting state via
+$\epsilon$-transitions. This new starting state is also an accepting
+state, because $r^*$ can recognise the empty string.
\begin{center}
+\begin{tabular}[b]{@{}ccc@{}}
\begin{tikzpicture}[node distance=3mm,
- >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
-\node at (0,0) (1) {$\mbox{}$};
-\node[state, initial] (2) [right=16mm of 1] {$\mbox{}$};
+ >=stealth',very thick,
+ every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},
+ baseline=(current bounding box.north)]
+\node (2) {$\mbox{}$};
+\node[state, initial] (4) [above=1mm of 2] {$\mbox{}$};
+\node[state, initial] (5) [below=1mm of 2] {$\mbox{}$};
\node (a) [right=of 2] {$\ldots$};
\node[state, accepting] (a1) [right=of a] {$\mbox{}$};
\node[state, accepting] (a2) [above=of a1] {$\mbox{}$};
@@ -342,341 +758,360 @@
\node [yshift=3mm] at (1.north) {$r$};
\end{pgfonlayer}
\end{tikzpicture}
-\end{center}
-
-\noindent and connect its accepting states to a new starting
-state via $\epsilon$-transitions. This new starting state is
-also an accepting state, because $r^*$ can recognise the
-empty string. This gives the following automaton for $r^*$:
-
-\begin{center}
+&
+\raisebox{-16mm}{\;\tikz{\draw[>=stealth,line width=2mm,->] (0,0) -- (1, 0)}}
+&
\begin{tikzpicture}[node distance=3mm,
- >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
+ >=stealth',very thick,
+ every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},
+ baseline=(current bounding box.north)]
\node at (0,0) [state, initial,accepting] (1) {$\mbox{}$};
-\node[state] (2) [right=16mm of 1] {$\mbox{}$};
+\node (2) [right=16mm of 1] {$\mbox{}$};
+\node[state] (4) [above=1mm of 2] {$\mbox{}$};
+\node[state] (5) [below=1mm of 2] {$\mbox{}$};
\node (a) [right=of 2] {$\ldots$};
\node[state] (a1) [right=of a] {$\mbox{}$};
\node[state] (a2) [above=of a1] {$\mbox{}$};
\node[state] (a3) [below=of a1] {$\mbox{}$};
-\path[->] (1) edge node [above] {$\epsilon$} (2);
-\path[->] (a1) edge [bend left=45] node [above] {$\epsilon$} (1);
+\path[->] (1) edge node [below] {$\epsilon$} (4);
+\path[->] (1) edge node [below] {$\epsilon$} (5);
+\path[->] (a1) edge [bend left=45] node [below] {$\epsilon$} (1);
\path[->] (a2) edge [bend right] node [below] {$\epsilon$} (1);
\path[->] (a3) edge [bend left=45] node [below] {$\epsilon$} (1);
\begin{pgfonlayer}{background}
\node (2) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (1) (a2) (a3)] {};
\node [yshift=3mm] at (2.north) {$r^*$};
\end{pgfonlayer}
-\end{tikzpicture}
+\end{tikzpicture}
+\end{tabular}
\end{center}
-\noindent This construction of a NFA from a regular expression
-was invented by Ken Thompson in 1968.
+\noindent
+The corresponding code is in Figure~\ref{thompson2} in Lines 35--43)
+
+To sum up, you can see in the sequence and star cases the need of
+having silent $\epsilon$-transitions. Similarly the alternative case
+shows the need of the NFA-nondeterminism. It seems awkward to form the
+`alternative' composition of two DFAs, because DFA do not allow
+several starting and successor states. All these constructions can now
+be put together in the following recursive function:
+
+
+{\small\begin{lstlisting}[language=Scala]
+def thompson(r: Rexp) : NFAt = r match {
+ case ZERO => NFA_ZERO()
+ case ONE => NFA_ONE()
+ case CHAR(c) => NFA_CHAR(c)
+ case ALT(r1, r2) => NFA_ALT(thompson(r1), thompson(r2))
+ case SEQ(r1, r2) => NFA_SEQ(thompson(r1), thompson(r2))
+ case STAR(r1) => NFA_STAR(thompson(r1))
+}
+\end{lstlisting}}
+
+\noindent
+It calculates a NFA from a regular expressions. At last we can run
+NFAs for the our evil regular expression examples. The graph on the
+left shows that when translating a regular expression such as
+$a^{\{n\}}$ into a NFA, the size can blow up and then even the
+relative fast (on small examples) breadth-first search can be
+slow. The graph on the right shows that with `evil' regular
+expressions the depth-first search can be abysmally slow. Even if the
+graphs not completely overlap with the curves of Python, Ruby and
+Java, they are similar enough. OK\ldots now you know why regular
+expression matchers in those languages are so slow.
-\subsubsection*{Subset Construction}
+\begin{center}
+\begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1mm}}c@{}}
+\begin{tikzpicture}
+\begin{axis}[
+ title={Graph: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings
+ $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
+ title style={yshift=-2ex},
+ xlabel={$n$},
+ x label style={at={(1.05,0.0)}},
+ ylabel={time in secs},
+ enlargelimits=false,
+ xtick={0,5,...,30},
+ xmax=33,
+ ymax=35,
+ ytick={0,5,...,30},
+ scaled ticks=false,
+ axis lines=left,
+ width=5.5cm,
+ height=4cm,
+ legend entries={Python,Ruby, breadth-first NFA},
+ legend style={at={(0.5,-0.25)},anchor=north,font=\small},
+ legend cell align=left]
+ \addplot[blue,mark=*, mark options={fill=white}] table {re-python.data};
+ \addplot[brown,mark=triangle*, mark options={fill=white}] table {re-ruby.data};
+ % breath-first search in NFAs
+ \addplot[red,mark=*, mark options={fill=white}] table {
+ 1 0.00586
+ 2 0.01209
+ 3 0.03076
+ 4 0.08269
+ 5 0.12881
+ 6 0.25146
+ 7 0.51377
+ 8 0.89079
+ 9 1.62802
+ 10 3.05326
+ 11 5.92437
+ 12 11.67863
+ 13 24.00568
+ };
+\end{axis}
+\end{tikzpicture}
+&
+\begin{tikzpicture}
+\begin{axis}[
+ title={Graph: $(a^*)^* \cdot b$ and strings
+ $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
+ title style={yshift=-2ex},
+ xlabel={$n$},
+ x label style={at={(1.05,0.0)}},
+ ylabel={time in secs},
+ enlargelimits=false,
+ xtick={0,5,...,30},
+ xmax=33,
+ ymax=35,
+ ytick={0,5,...,30},
+ scaled ticks=false,
+ axis lines=left,
+ width=5.5cm,
+ height=4cm,
+ legend entries={Python, Java, depth-first NFA},
+ legend style={at={(0.5,-0.25)},anchor=north,font=\small},
+ legend cell align=left]
+ \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
+ \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
+ % depth-first search in NFAs
+ \addplot[red,mark=*, mark options={fill=white}] table {
+ 1 0.00605
+ 2 0.03086
+ 3 0.11994
+ 4 0.45389
+ 5 2.06192
+ 6 8.04894
+ 7 32.63549
+ };
+\end{axis}
+\end{tikzpicture}
+\end{tabular}
+\end{center}
-What is interesting is that for every NFA we can find a DFA
-which recognises the same language. This can, for example, be
-done by the \emph{subset construction}. Consider again the NFA
-below on the left, consisting of nodes labeled $0$, $1$ and $2$.
+
+
+\subsection*{Subset Construction}
+
+Of course, some developers of regular expression matchers are aware of
+these problems with sluggish NFAs and try to address them. One common
+technique for alleviating the problem I like to show you in this
+section. This will also explain why we insisted on polymorphic types in
+our DFA code (remember I used \texttt{A} and \texttt{C} for the types
+of states and the input, see Figure~\ref{dfa} on
+Page~\pageref{dfa}).\bigskip
+
+\noindent
+To start remember that we did not bother with defining and
+implementing $\epsilon$NFAs: we immediately translated them into
+equivalent NFAs. Equivalent in the sense of accepting the same
+language (though we only claimed this and did not prove it
+rigorously). Remember also that NFAs have non-deterministic
+transitions defined as a relation or implemented as function returning
+sets of states. This non-determinism is crucial for the Thompson
+Construction to work (recall the cases for $\cdot$, $+$ and
+${}^*$). But this non-determinism makes it harder with NFAs to decide
+when a string is accepted or not; whereas such a decision is rather
+straightforward with DFAs: recall their transition function is a
+\emph{function} that returns a single state. So with DFAs we do not
+have to search at all. What is perhaps interesting is the fact that
+for every NFA we can find a DFA that also recognises the same
+language. This might sound a bit paradoxical: NFA $\rightarrow$
+decision of acceptance hard; DFA $\rightarrow$ decision easy. But this
+\emph{is} true\ldots but of course there is always a caveat---nothing
+ever is for free in life.
+
+There are actually a number of methods for transforming a NFA into
+an equivalent DFA, but the most famous one is the \emph{subset
+ construction}. Consider the following NFA where the states are
+labelled with $0$, $1$ and $2$.
\begin{center}
\begin{tabular}{c@{\hspace{10mm}}c}
\begin{tikzpicture}[scale=0.7,>=stealth',very thick,
every state/.style={minimum size=0pt,
draw=blue!50,very thick,fill=blue!20},
- baseline=0mm]
-\node[state,initial] (q_0) {$0$};
-\node[state] (q_1) [above=of q_0] {$1$};
-\node[state, accepting] (q_2) [below=of q_0] {$2$};
-\path[->] (q_0) edge node [left] {$\epsilon$} (q_1);
-\path[->] (q_0) edge node [left] {$\epsilon$} (q_2);
-\path[->] (q_0) edge [loop right] node {$a$} ();
-\path[->] (q_1) edge [loop above] node {$a$} ();
-\path[->] (q_2) edge [loop below] node {$b$} ();
+ baseline=(current bounding box.center)]
+\node[state,initial] (Q_0) {$0$};
+\node[state] (Q_1) [below=of Q_0] {$1$};
+\node[state, accepting] (Q_2) [below=of Q_1] {$2$};
+
+\path[->] (Q_0) edge node [right] {$b$} (Q_1);
+\path[->] (Q_1) edge node [right] {$a,b$} (Q_2);
+\path[->] (Q_0) edge [loop above] node {$a, b$} ();
\end{tikzpicture}
&
-\begin{tabular}{r|cl}
-nodes & $a$ & $b$\\
+\begin{tabular}{r|ll}
+states & $a$ & $b$\\
\hline
$\{\}\phantom{\star}$ & $\{\}$ & $\{\}$\\
-$\{0\}\phantom{\star}$ & $\{0,1,2\}$ & $\{2\}$\\
-$\{1\}\phantom{\star}$ & $\{1\}$ & $\{\}$\\
-$\{2\}\star$ & $\{\}$ & $\{2\}$\\
-$\{0,1\}\phantom{\star}$ & $\{0,1,2\}$ & $\{2\}$\\
-$\{0,2\}\star$ & $\{0,1,2\}$ & $\{2\}$\\
-$\{1,2\}\star$ & $\{1\}$ & $\{2\}$\\
-s: $\{0,1,2\}\star$ & $\{0,1,2\}$ & $\{2\}$\\
+start: $\{0\}\phantom{\star}$ & $\{0\}$ & $\{0,1\}$\\
+$\{1\}\phantom{\star}$ & $\{2\}$ & $\{2\}$\\
+$\{2\}\star$ & $\{\}$ & $\{\}$\\
+$\{0,1\}\phantom{\star}$ & $\{0,2\}$ & $\{0,1,2\}$\\
+$\{0,2\}\star$ & $\{0\}$ & $\{0,1\}$\\
+$\{1,2\}\star$ & $\{2\}$ & $\{2\}$\\
+$\{0,1,2\}\star$ & $\{0,2\}$ & $\{0,1,2\}$\\
\end{tabular}
\end{tabular}
\end{center}
-\noindent The nodes of the DFA are given by calculating all
-subsets of the set of nodes of the NFA (seen in the nodes
-column on the right). The table shows the transition function
-for the DFA. The first row states that $\{\}$ is the
-sink node which has transitions for $a$ and $b$ to itself.
-The next three lines are calculated as follows:
+\noindent The states of the corresponding DFA are given by generating
+all subsets of the set $\{0,1,2\}$ (seen in the states column
+in the table on the right). The other columns define the transition
+function for the DFA for inputs $a$ and $b$. The first row states that
+$\{\}$ is the sink state which has transitions for $a$ and $b$ to
+itself. The next three lines are calculated as follows:
\begin{itemize}
-\item suppose you calculate the entry for the transition for
- $a$ and the node $\{0\}$
-\item start from the node $0$ in the NFA
-\item do as many $\epsilon$-transition as you can obtaining a
- set of nodes, in this case $\{0,1,2\}$
-\item filter out all notes that do not allow an $a$-transition
- from this set, this excludes $2$ which does not permit a
- $a$-transition
-\item from the remaining set, do as many $\epsilon$-transition
- as you can, this yields again $\{0,1,2\}$
-\item the resulting set specifies the transition from $\{0\}$
- when given an $a$
+\item Suppose you calculate the entry for the $a$-transition for state
+ $\{0\}$. Look for all states in the NFA that can be reached by such
+ a transition from this state; this is only state $0$; therefore from
+ state $\{0\}$ we can go to state $\{0\}$ via an $a$-transition.
+\item Do the same for the $b$-transition; you can reach states $0$ and
+ $1$ in the NFA; therefore in the DFA we can go from state $\{0\}$ to
+ state $\{0,1\}$ via an $b$-transition.
+\item Continue with the states $\{1\}$ and $\{2\}$.
\end{itemize}
-\noindent So the transition from the state $\{0\}$ reading an
-$a$ goes to the state $\{0,1,2\}$. Similarly for the other
-entries in the rows for $\{0\}$, $\{1\}$ and $\{2\}$. The
-other rows are calculated by just taking the union of the
-single node entries. For example for $a$ and $\{0,1\}$ we need
-to take the union of $\{0,1,2\}$ (for $0$) and $\{1\}$ (for
-$1$). The starting state of the DFA can be calculated from the
-starting state of the NFA, that is $0$, and then do as many
-$\epsilon$-transitions as possible. This gives $\{0,1,2\}$
-which is the starting state of the DFA. The terminal states in
-the DFA are given by all sets that contain a $2$, which is the
-terminal state of the NFA. This completes the subset
-construction. So the corresponding DFA to the NFA from
-above is
+\noindent
+Once you filled in the transitions for `simple' states $\{0\}$
+.. $\{2\}$, you only have to build the union for the compound states
+$\{0,1\}$, $\{0,2\}$ and so on. For example for $\{0,1\}$ you take the
+union of Line $\{0\}$ and Line $\{1\}$, which gives $\{0,2\}$ for $a$,
+and $\{0,1,2\}$ for $b$. And so on.
-\begin{center}
-\begin{tikzpicture}[scale=0.7,>=stealth',very thick,
- every state/.style={minimum size=0pt,
- draw=blue!50,very thick,fill=blue!20},
- baseline=0mm]
-\node[state,initial,accepting] (q012) {$0,1,2$};
-\node[state,accepting] (q02) [right=of q012] {$0,2$};
-\node[state] (q01) [above=of q02] {$0,1$};
-\node[state,accepting] (q12) [below=of q02] {$1,2$};
-\node[state] (q0) [right=2cm of q01] {$0$};
-\node[state] (q1) [right=2.5cm of q02] {$1$};
-\node[state,accepting] (q2) [right=1.5cm of q12] {$2$};
-\node[state] (qn) [right=of q1] {$\{\}$};
+The starting state of the DFA can be calculated from the starting
+states of the NFA, that is in this case $\{0\}$. But in general there
+can of course be many starting states in the NFA and you would take
+the corresponding subset as \emph{the} starting state of the DFA.
+
+The accepting states in the DFA are given by all sets that contain a
+$2$, which is the only accpting state in this NFA. But again in
+general if the subset contains any accepting state from the NFA, then
+the corresponding state in the DFA is accepting as well. This
+completes the subset construction. The corresponding DFA for the NFA
+shown above is:
-\path[->] (q012) edge [loop below] node {$a$} ();
-\path[->] (q012) edge node [above] {$b$} (q2);
-\path[->] (q12) edge [bend left] node [below,pos=0.4] {$a$} (q1);
-\path[->] (q12) edge node [below] {$b$} (q2);
-\path[->] (q02) edge node [above] {$a$} (q012);
-\path[->] (q02) edge [bend left] node [above, pos=0.8] {$b$} (q2);
-\path[->] (q01) edge node [below] {$a$} (q012);
-\path[->] (q01) edge [bend left] node [above] {$b$} (q2);
-\path[->] (q0) edge node [below] {$a$} (q012);
-\path[->] (q0) edge node [right, pos=0.2] {$b$} (q2);
-\path[->] (q1) edge [loop above] node {$a$} ();
-\path[->] (q1) edge node [above] {$b$} (qn);
-\path[->] (q2) edge [loop right] node {$b$} ();
-\path[->] (q2) edge node [below] {$a$} (qn);
-\path[->] (qn) edge [loop above] node {$a,b$} ();
-\end{tikzpicture}
-\end{center}
-
-
+\begin{equation}
+\begin{tikzpicture}[scale=0.8,>=stealth',very thick,
+ every state/.style={minimum size=0pt,
+ draw=blue!50,very thick,fill=blue!20},
+ baseline=(current bounding box.center)]
+\node[state,initial] (q0) {$0$};
+\node[state] (q01) [right=of q0] {$0,1$};
+\node[state,accepting] (q02) [below=of q01] {$0,2$};
+\node[state,accepting] (q012) [right=of q02] {$0,1,2$};
+\node[state] (q1) [below=0.5cm of q0] {$1$};
+\node[state,accepting] (q2) [below=1cm of q1] {$2$};
+\node[state] (qn) [below left=1cm of q2] {$\{\}$};
+\node[state,accepting] (q12) [below right=1cm of q2] {$1,2$};
-There are two points to note: One is that very often the
-resulting DFA contains a number of ``dead'' nodes that are
-never reachable from the starting state. For example
-there is no way to reach node $\{0,2\}$ from the starting
-state $\{0,1,2\}$. I let you find the other dead states.
-In effect the DFA in this example is not a minimal DFA. Such
-dead nodes can be safely removed without changing the language
-that is recognised by the DFA. Another point is that in some
-cases, however, the subset construction produces a DFA that
-does \emph{not} contain any dead nodes\ldots{}that means it
-calculates a minimal DFA. Which in turn means that in some
-cases the number of nodes by going from NFAs to DFAs
-exponentially increases, namely by $2^n$ (which is the number
-of subsets you can form for $n$ nodes).
-
-Removing all the dead states in the automaton above,
-gives a much more legible automaton, namely
+\path[->] (q0) edge node [above] {$b$} (q01);
+\path[->] (q01) edge node [above] {$b$} (q012);
+\path[->] (q0) edge [loop above] node {$a$} ();
+\path[->] (q012) edge [loop right] node {$b$} ();
+\path[->] (q012) edge node [below] {$a$} (q02);
+\path[->] (q02) edge node [below] {$a$} (q0);
+\path[->] (q01) edge [bend left] node [left] {$a$} (q02);
+\path[->] (q02) edge [bend left] node [right] {$b$} (q01);
+\path[->] (q1) edge node [left] {$a,b$} (q2);
+\path[->] (q12) edge node [right] {$a, b$} (q2);
+\path[->] (q2) edge node [right] {$a, b$} (qn);
+\path[->] (qn) edge [loop left] node {$a,b$} ();
+\end{tikzpicture}\label{subsetdfa}
+\end{equation}
-\begin{center}
-\begin{tikzpicture}[scale=0.7,>=stealth',very thick,
- every state/.style={minimum size=0pt,
- draw=blue!50,very thick,fill=blue!20},
- baseline=0mm]
-\node[state,initial,accepting] (q012) {$0,1,2$};
-\node[state,accepting] (q2) [right=of q012] {$2$};
-\node[state] (qn) [right=of q2] {$\{\}$};
-
-\path[->] (q012) edge [loop below] node {$a$} ();
-\path[->] (q012) edge node [above] {$b$} (q2);
-\path[->] (q2) edge [loop below] node {$b$} ();
-\path[->] (q2) edge node [below] {$a$} (qn);
-\path[->] (qn) edge [loop above] node {$a,b$} ();
-\end{tikzpicture}
-\end{center}
-
-\noindent Now the big question is whether this DFA
-can recognise the same language as the NFA we started with.
+\noindent
+Please check that this is indeed a DFA. The big question is whether
+this DFA can recognise the same language as the NFA we started with?
I let you ponder about this question.
-\subsubsection*{Brzozowski's Method}
-
-As said before, we can also go into the other direction---from
-DFAs to regular expressions. Brzozowski's method calculates
-a regular expression using familiar transformations for
-solving equational systems. Consider the DFA:
-
-\begin{center}
-\begin{tikzpicture}[scale=1.5,>=stealth',very thick,auto,
- every state/.style={minimum size=0pt,
- inner sep=2pt,draw=blue!50,very thick,
- fill=blue!20}]
- \node[state, initial] (q0) at ( 0,1) {$q_0$};
- \node[state] (q1) at ( 1,1) {$q_1$};
- \node[state, accepting] (q2) at ( 2,1) {$q_2$};
- \path[->] (q0) edge[bend left] node[above] {$a$} (q1)
- (q1) edge[bend left] node[above] {$b$} (q0)
- (q2) edge[bend left=50] node[below] {$b$} (q0)
- (q1) edge node[above] {$a$} (q2)
- (q2) edge [loop right] node {$a$} ()
- (q0) edge [loop below] node {$b$} ();
-\end{tikzpicture}
-\end{center}
-
-\noindent for which we can set up the following equational
-system
-
-\begin{eqnarray}
-q_0 & = & \ONE + q_0\,b + q_1\,b + q_2\,b\\
-q_1 & = & q_0\,a\\
-q_2 & = & q_1\,a + q_2\,a
-\end{eqnarray}
-\noindent There is an equation for each node in the DFA. Let
-us have a look how the right-hand sides of the equations are
-constructed. First have a look at the second equation: the
-left-hand side is $q_1$ and the right-hand side $q_0\,a$. The
-right-hand side is essentially all possible ways how to end up
-in node $q_1$. There is only one incoming edge from $q_0$ consuming
-an $a$. Therefore the right hand side is this
-state followed by character---in this case $q_0\,a$. Now lets
-have a look at the third equation: there are two incoming
-edges for $q_2$. Therefore we have two terms, namely $q_1\,a$ and
-$q_2\,a$. These terms are separated by $+$. The first states
-that if in state $q_1$ consuming an $a$ will bring you to
-$q_2$, and the secont that being in $q_2$ and consuming an $a$
-will make you stay in $q_2$. The right-hand side of the
-first equation is constructed similarly: there are three
-incoming edges, therefore there are three terms. There is
-one exception in that we also ``add'' $\ONE$ to the
-first equation, because it corresponds to the starting state
-in the DFA.
+There are also two points to note: One is that very often in the
+subset construction the resulting DFA contains a number of ``dead''
+states that are never reachable from the starting state. This is
+obvious in the example, where state $\{1\}$, $\{2\}$, $\{1,2\}$ and
+$\{\}$ can never be reached from the starting state. But this might
+not always be as obvious as that. In effect the DFA in this example is
+not a \emph{minimal} DFA (more about this in a minute). Such dead
+states can be safely removed without changing the language that is
+recognised by the DFA. Another point is that in some cases, however,
+the subset construction produces a DFA that does \emph{not} contain
+any dead states\ldots{}this means it calculates a minimal DFA. Which
+in turn means that in some cases the number of states can by going
+from NFAs to DFAs exponentially increase, namely by $2^n$ (which is
+the number of subsets you can form for sets of $n$ states). This blow
+up in the number of states in the DFA is again bad news for how
+quickly you can decide whether a string is accepted by a DFA or
+not. So the caveat with DFAs is that they might make the task of
+finding the next state trival, but might require $2^n$ times as many
+states then a NFA.\bigskip
-Having constructed the equational system, the question is
-how to solve it? Remarkably the rules are very similar to
-solving usual linear equational systems. For example the
-second equation does not contain the variable $q_1$ on the
-right-hand side of the equation. We can therefore eliminate
-$q_1$ from the system by just substituting this equation
-into the other two. This gives
+\noindent
+To conclude this section, how conveniently we can
+implement the subset construction with our versions of NFAs and
+DFAs? Very conveninetly. The code is just:
-\begin{eqnarray}
-q_0 & = & \ONE + q_0\,b + q_0\,a\,b + q_2\,b\\
-q_2 & = & q_0\,a\,a + q_2\,a
-\end{eqnarray}
-
-\noindent where in Equation (4) we have two occurences
-of $q_0$. Like the laws about $+$ and $\cdot$, we can simplify
-Equation (4) to obtain the following two equations:
+{\small\begin{lstlisting}[language=Scala]
+def subset[A, C](nfa: NFA[A, C]) : DFA[Set[A], C] = {
+ DFA(nfa.starts,
+ { case (qs, c) => nfa.nexts(qs, c) },
+ _.exists(nfa.fins))
+}
+\end{lstlisting}}
-\begin{eqnarray}
-q_0 & = & \ONE + q_0\,(b + a\,b) + q_2\,b\\
-q_2 & = & q_0\,a\,a + q_2\,a
-\end{eqnarray}
-
-\noindent Unfortunately we cannot make any more progress with
-substituting equations, because both (6) and (7) contain the
-variable on the left-hand side also on the right-hand side.
-Here we need to now use a law that is different from the usual
-laws about linear equations. It is called \emph{Arden's rule}.
-It states that if an equation is of the form $q = q\,r + s$
-then it can be transformed to $q = s\, r^*$. Since we can
-assume $+$ is symmetric, Equation (7) is of that form: $s$ is
-$q_0\,a\,a$ and $r$ is $a$. That means we can transform
-(7) to obtain the two new equations
-
-\begin{eqnarray}
-q_0 & = & \ONE + q_0\,(b + a\,b) + q_2\,b\\
-q_2 & = & q_0\,a\,a\,(a^*)
-\end{eqnarray}
+\noindent
+The interesting point in this code is that the state type of the
+calculated DFA is \texttt{Set[A]}. Think carefully that this works out
+correctly.
-\noindent Now again we can substitute the second equation into
-the first in order to eliminate the variable $q_2$.
-
-\begin{eqnarray}
-q_0 & = & \ONE + q_0\,(b + a\,b) + q_0\,a\,a\,(a^*)\,b
-\end{eqnarray}
-
-\noindent Pulling $q_0$ out as a single factor gives:
-
-\begin{eqnarray}
-q_0 & = & \ONE + q_0\,(b + a\,b + a\,a\,(a^*)\,b)
-\end{eqnarray}
-
-\noindent This equation is again of the form so that we can
-apply Arden's rule ($r$ is $b + a\,b + a\,a\,(a^*)\,b$ and $s$
-is $\ONE$). This gives as solution for $q_0$ the following
-regular expression:
-
-\begin{eqnarray}
-q_0 & = & \ONE\,(b + a\,b + a\,a\,(a^*)\,b)^*
-\end{eqnarray}
+The DFA is then given by three components: the starting states, the
+transition function and the accepting-states function. The starting
+states are a set in the given NFA, but a single state in the DFA. The
+transition function, given the state \texttt{qs} and input \texttt{c},
+needs to produce the next state: this is the set of all NFA states
+that are reachable from each state in \texttt{qs}. The function
+\texttt{nexts} from the NFA class already calculates this for us. The
+accepting-states function for the DFA is true henevner at least one
+state in the subset is accepting (that is true) in the NFA.\medskip
-\noindent Since this is a regular expression, we can simplify
-away the $\ONE$ to obtain the slightly simpler regular
-expression
-
-\begin{eqnarray}
-q_0 & = & (b + a\,b + a\,a\,(a^*)\,b)^*
-\end{eqnarray}
-
-\noindent
-Now we can unwind this process and obtain the solutions
-for the other equations. This gives:
-
-\begin{eqnarray}
-q_0 & = & (b + a\,b + a\,a\,(a^*)\,b)^*\\
-q_1 & = & (b + a\,b + a\,a\,(a^*)\,b)^*\,a\\
-q_2 & = & (b + a\,b + a\,a\,(a^*)\,b)^*\,a\,a\,(a)^*
-\end{eqnarray}
+\noindent
+You might be able to spend some quality time tinkering with this code
+and time to ponder about it. Then you will probably notice that it is
+actually a bit silly. The whole point of translating the NFA into a
+DFA via the subset construction is to make the decision of whether a
+string is accepted or not faster. Given the code above, the generated
+DFA will be exactly as fast, or as slow, as the NFA we started with
+(actually it will even be a tiny bit slower). The reason is that we
+just re-use the \texttt{nexts} function from the NFA. This function
+implements the non-deterministic breadth-first search. You might be
+thinking: This is cheating! \ldots{} Well, not quite as you will see
+later, but in terms of speed we still need to work a bit in order to
+get sometimes(!) a faster DFA. Let's do this next.
-\noindent Finally, we only need to ``add'' up the equations
-which correspond to a terminal state. In our running example,
-this is just $q_2$. Consequently, a regular expression
-that recognises the same language as the automaton is
-
-\[
-(b + a\,b + a\,a\,(a^*)\,b)^*\,a\,a\,(a)^*
-\]
+\subsection*{DFA Minimisation}
-\noindent You can somewhat crosscheck your solution
-by taking a string the regular expression can match and
-and see whether it can be matched by the automaton.
-One string for example is $aaa$ and \emph{voila} this
-string is also matched by the automaton.
-
-We should prove that Brzozowski's method really produces
-an equivalent regular expression for the automaton. But
-for the purposes of this module, we omit this.
-
-\subsubsection*{Automata Minimization}
-
-As seen in the subset construction, the translation
-of an NFA to a DFA can result in a rather ``inefficient''
-DFA. Meaning there are states that are not needed. A
-DFA can be \emph{minimised} by the following algorithm:
+As seen in \eqref{subsetdfa}, the subset construction from NFA to a
+DFA can result in a rather ``inefficient'' DFA. Meaning there are
+states that are not needed. There are two kinds of such unneeded
+states: \emph{unreachable} states and \emph{non-distinguishable}
+states. The first kind of states can just be removed without affecting
+the language that can be recognised (after all they are
+unreachable). The second kind can also be recognised and thus a DFA
+can be \emph{minimised} by the following algorithm:
\begin{enumerate}
\item Take all pairs $(q, p)$ with $q \not= p$
@@ -693,29 +1128,30 @@
\item All unmarked pairs can be merged.
\end{enumerate}
-\noindent To illustrate this algorithm, consider the following
-DFA.
+\noindent Unfortunately, once we throw away all unreachable states in
+\eqref{subsetdfa}, all remaining states are needed. In order to
+illustrate the minimisation algorithm, consider the following DFA.
\begin{center}
\begin{tikzpicture}[>=stealth',very thick,auto,
every state/.style={minimum size=0pt,
inner sep=2pt,draw=blue!50,very thick,
fill=blue!20}]
-\node[state,initial] (q_0) {$q_0$};
-\node[state] (q_1) [right=of q_0] {$q_1$};
-\node[state] (q_2) [below right=of q_0] {$q_2$};
-\node[state] (q_3) [right=of q_2] {$q_3$};
-\node[state, accepting] (q_4) [right=of q_1] {$q_4$};
-\path[->] (q_0) edge node [above] {$a$} (q_1);
-\path[->] (q_1) edge node [above] {$a$} (q_4);
-\path[->] (q_4) edge [loop right] node {$a, b$} ();
-\path[->] (q_3) edge node [right] {$a$} (q_4);
-\path[->] (q_2) edge node [above] {$a$} (q_3);
-\path[->] (q_1) edge node [right] {$b$} (q_2);
-\path[->] (q_0) edge node [above] {$b$} (q_2);
-\path[->] (q_2) edge [loop left] node {$b$} ();
-\path[->] (q_3) edge [bend left=95, looseness=1.3] node
- [below] {$b$} (q_0);
+\node[state,initial] (Q_0) {$Q_0$};
+\node[state] (Q_1) [right=of Q_0] {$Q_1$};
+\node[state] (Q_2) [below right=of Q_0] {$Q_2$};
+\node[state] (Q_3) [right=of Q_2] {$Q_3$};
+\node[state, accepting] (Q_4) [right=of Q_1] {$Q_4$};
+\path[->] (Q_0) edge node [above] {$a$} (Q_1);
+\path[->] (Q_1) edge node [above] {$a$} (Q_4);
+\path[->] (Q_4) edge [loop right] node {$a, b$} ();
+\path[->] (Q_3) edge node [right] {$a$} (Q_4);
+\path[->] (Q_2) edge node [above] {$a$} (Q_3);
+\path[->] (Q_1) edge node [right] {$b$} (Q_2);
+\path[->] (Q_0) edge node [above] {$b$} (Q_2);
+\path[->] (Q_2) edge [loop left] node {$b$} ();
+\path[->] (Q_3) edge [bend left=95, looseness=1.3] node
+ [below] {$b$} (Q_0);
\end{tikzpicture}
\end{center}
@@ -736,15 +1172,15 @@
\draw (3,0) -- (3, 2);
\draw (4,0) -- (4, 1);
-\draw (0.5,-0.5) node {$q_0$};
-\draw (1.5,-0.5) node {$q_1$};
-\draw (2.5,-0.5) node {$q_2$};
-\draw (3.5,-0.5) node {$q_3$};
+\draw (0.5,-0.5) node {$Q_0$};
+\draw (1.5,-0.5) node {$Q_1$};
+\draw (2.5,-0.5) node {$Q_2$};
+\draw (3.5,-0.5) node {$Q_3$};
-\draw (-0.5, 3.5) node {$q_1$};
-\draw (-0.5, 2.5) node {$q_2$};
-\draw (-0.5, 1.5) node {$q_3$};
-\draw (-0.5, 0.5) node {$q_4$};
+\draw (-0.5, 3.5) node {$Q_1$};
+\draw (-0.5, 2.5) node {$Q_2$};
+\draw (-0.5, 1.5) node {$Q_3$};
+\draw (-0.5, 0.5) node {$Q_4$};
\draw (0.5,0.5) node {\large$\star$};
\draw (1.5,0.5) node {\large$\star$};
@@ -755,10 +1191,10 @@
\noindent where the lower row is filled with stars, because in
the corresponding pairs there is always one state that is
-accepting ($q_4$) and a state that is non-accepting (the other
+accepting ($Q_4$) and a state that is non-accepting (the other
states).
-Now in Step 3 we need to fill in more stars according whether
+In Step 3 we need to fill in more stars according whether
one of the next-state pairs are marked. We have to do this
for every unmarked field until there is no change anymore.
This gives the triangle
@@ -777,15 +1213,15 @@
\draw (3,0) -- (3, 2);
\draw (4,0) -- (4, 1);
-\draw (0.5,-0.5) node {$q_0$};
-\draw (1.5,-0.5) node {$q_1$};
-\draw (2.5,-0.5) node {$q_2$};
-\draw (3.5,-0.5) node {$q_3$};
+\draw (0.5,-0.5) node {$Q_0$};
+\draw (1.5,-0.5) node {$Q_1$};
+\draw (2.5,-0.5) node {$Q_2$};
+\draw (3.5,-0.5) node {$Q_3$};
-\draw (-0.5, 3.5) node {$q_1$};
-\draw (-0.5, 2.5) node {$q_2$};
-\draw (-0.5, 1.5) node {$q_3$};
-\draw (-0.5, 0.5) node {$q_4$};
+\draw (-0.5, 3.5) node {$Q_1$};
+\draw (-0.5, 2.5) node {$Q_2$};
+\draw (-0.5, 1.5) node {$Q_3$};
+\draw (-0.5, 0.5) node {$Q_4$};
\draw (0.5,0.5) node {\large$\star$};
\draw (1.5,0.5) node {\large$\star$};
@@ -798,27 +1234,202 @@
\end{tikzpicture}
\end{center}
-\noindent which means states $q_0$ and $q_2$, as well as $q_1$
-and $q_3$ can be merged. This gives the following minimal DFA
+\noindent which means states $Q_0$ and $Q_2$, as well as $Q_1$
+and $Q_3$ can be merged. This gives the following minimal DFA
\begin{center}
\begin{tikzpicture}[>=stealth',very thick,auto,
every state/.style={minimum size=0pt,
inner sep=2pt,draw=blue!50,very thick,
fill=blue!20}]
-\node[state,initial] (q_02) {$q_{0, 2}$};
-\node[state] (q_13) [right=of q_02] {$q_{1, 3}$};
-\node[state, accepting] (q_4) [right=of q_13]
- {$q_{4\phantom{,0}}$};
-\path[->] (q_02) edge [bend left] node [above] {$a$} (q_13);
-\path[->] (q_13) edge [bend left] node [below] {$b$} (q_02);
-\path[->] (q_02) edge [loop below] node {$b$} ();
-\path[->] (q_13) edge node [above] {$a$} (q_4);
-\path[->] (q_4) edge [loop above] node {$a, b$} ();
+\node[state,initial] (Q_02) {$Q_{0, 2}$};
+\node[state] (Q_13) [right=of Q_02] {$Q_{1, 3}$};
+\node[state, accepting] (Q_4) [right=of Q_13]
+ {$Q_{4\phantom{,0}}$};
+\path[->] (Q_02) edge [bend left] node [above] {$a$} (Q_13);
+\path[->] (Q_13) edge [bend left] node [below] {$b$} (Q_02);
+\path[->] (Q_02) edge [loop below] node {$b$} ();
+\path[->] (Q_13) edge node [above] {$a$} (Q_4);
+\path[->] (Q_4) edge [loop above] node {$a, b$} ();
+\end{tikzpicture}
+\end{center}
+
+
+By the way, we are not bothering with implementing the above
+minimisation algorith: while up to now all the transformations used
+some clever composition of functions, the minimisation algorithm
+cannot be implemented by just composing some functions. For this we
+would require a more concrete representation of the transition
+function (like maps). If we did this, however, then many advantages of
+the functions would be thrown away. So the compromise is to not being
+able to minimise (easily) our DFAs.
+
+\subsection*{Brzozowski's Method}
+
+I know this handout is already a long, long rant: but after all it is
+a topic that has been researched for more than 60 years. If you
+reflect on what you have read so far, the story is that you can take a
+regular expression, translate it via the Thompson Construction into an
+$\epsilon$NFA, then translate it into a NFA by removing all
+$\epsilon$-transitions, and then via the subset construction obtain a
+DFA. In all steps we made sure the language, or which strings can be
+recognised, stays the same. Of couse we should have proved this in
+each step, but let us cut corners here. After the last section, we
+can even minimise the DFA (maybe not in code). But again we made sure
+the same language is recognised. You might be wondering: Can we go
+into the other direction? Can we go from a DFA and obtain a regular
+expression that can recognise the same language as the DFA?\medskip
+
+\noindent
+The answer is yes. Again there are several methods for calculating a
+regular expression for a DFA. I will show you Brzozowski's method
+because it calculates a regular expression using quite familiar
+transformations for solving equational systems. Consider the DFA:
+
+\begin{center}
+\begin{tikzpicture}[scale=1.5,>=stealth',very thick,auto,
+ every state/.style={minimum size=0pt,
+ inner sep=2pt,draw=blue!50,very thick,
+ fill=blue!20}]
+ \node[state, initial] (q0) at ( 0,1) {$Q_0$};
+ \node[state] (q1) at ( 1,1) {$Q_1$};
+ \node[state, accepting] (q2) at ( 2,1) {$Q_2$};
+ \path[->] (q0) edge[bend left] node[above] {$a$} (q1)
+ (q1) edge[bend left] node[above] {$b$} (q0)
+ (q2) edge[bend left=50] node[below] {$b$} (q0)
+ (q1) edge node[above] {$a$} (q2)
+ (q2) edge [loop right] node {$a$} ()
+ (q0) edge [loop below] node {$b$} ();
\end{tikzpicture}
\end{center}
-\subsubsection*{Regular Languages}
+\noindent for which we can set up the following equational
+system
+
+\begin{eqnarray}
+Q_0 & = & \ONE + Q_0\,b + Q_1\,b + Q_2\,b\\
+Q_1 & = & Q_0\,a\\
+Q_2 & = & Q_1\,a + Q_2\,a
+\end{eqnarray}
+
+\noindent There is an equation for each node in the DFA. Let
+us have a look how the right-hand sides of the equations are
+constructed. First have a look at the second equation: the
+left-hand side is $Q_1$ and the right-hand side $Q_0\,a$. The
+right-hand side is essentially all possible ways how to end up
+in node $Q_1$. There is only one incoming edge from $Q_0$ consuming
+an $a$. Therefore the right hand side is this
+state followed by character---in this case $Q_0\,a$. Now lets
+have a look at the third equation: there are two incoming
+edges for $Q_2$. Therefore we have two terms, namely $Q_1\,a$ and
+$Q_2\,a$. These terms are separated by $+$. The first states
+that if in state $Q_1$ consuming an $a$ will bring you to
+$Q_2$, and the second that being in $Q_2$ and consuming an $a$
+will make you stay in $Q_2$. The right-hand side of the
+first equation is constructed similarly: there are three
+incoming edges, therefore there are three terms. There is
+one exception in that we also ``add'' $\ONE$ to the
+first equation, because it corresponds to the starting state
+in the DFA.
+
+Having constructed the equational system, the question is
+how to solve it? Remarkably the rules are very similar to
+solving usual linear equational systems. For example the
+second equation does not contain the variable $Q_1$ on the
+right-hand side of the equation. We can therefore eliminate
+$Q_1$ from the system by just substituting this equation
+into the other two. This gives
+
+\begin{eqnarray}
+Q_0 & = & \ONE + Q_0\,b + Q_0\,a\,b + Q_2\,b\\
+Q_2 & = & Q_0\,a\,a + Q_2\,a
+\end{eqnarray}
+
+\noindent where in Equation (4) we have two occurrences
+of $Q_0$. Like the laws about $+$ and $\cdot$, we can simplify
+Equation (4) to obtain the following two equations:
+
+\begin{eqnarray}
+Q_0 & = & \ONE + Q_0\,(b + a\,b) + Q_2\,b\\
+Q_2 & = & Q_0\,a\,a + Q_2\,a
+\end{eqnarray}
+
+\noindent Unfortunately we cannot make any more progress with
+substituting equations, because both (6) and (7) contain the
+variable on the left-hand side also on the right-hand side.
+Here we need to now use a law that is different from the usual
+laws about linear equations. It is called \emph{Arden's rule}.
+It states that if an equation is of the form $q = q\,r + s$
+then it can be transformed to $q = s\, r^*$. Since we can
+assume $+$ is symmetric, Equation (7) is of that form: $s$ is
+$Q_0\,a\,a$ and $r$ is $a$. That means we can transform
+(7) to obtain the two new equations
+
+\begin{eqnarray}
+Q_0 & = & \ONE + Q_0\,(b + a\,b) + Q_2\,b\\
+Q_2 & = & Q_0\,a\,a\,(a^*)
+\end{eqnarray}
+
+\noindent Now again we can substitute the second equation into
+the first in order to eliminate the variable $Q_2$.
+
+\begin{eqnarray}
+Q_0 & = & \ONE + Q_0\,(b + a\,b) + Q_0\,a\,a\,(a^*)\,b
+\end{eqnarray}
+
+\noindent Pulling $Q_0$ out as a single factor gives:
+
+\begin{eqnarray}
+Q_0 & = & \ONE + Q_0\,(b + a\,b + a\,a\,(a^*)\,b)
+\end{eqnarray}
+
+\noindent This equation is again of the form so that we can
+apply Arden's rule ($r$ is $b + a\,b + a\,a\,(a^*)\,b$ and $s$
+is $\ONE$). This gives as solution for $Q_0$ the following
+regular expression:
+
+\begin{eqnarray}
+Q_0 & = & \ONE\,(b + a\,b + a\,a\,(a^*)\,b)^*
+\end{eqnarray}
+
+\noindent Since this is a regular expression, we can simplify
+away the $\ONE$ to obtain the slightly simpler regular
+expression
+
+\begin{eqnarray}
+Q_0 & = & (b + a\,b + a\,a\,(a^*)\,b)^*
+\end{eqnarray}
+
+\noindent
+Now we can unwind this process and obtain the solutions
+for the other equations. This gives:
+
+\begin{eqnarray}
+Q_0 & = & (b + a\,b + a\,a\,(a^*)\,b)^*\\
+Q_1 & = & (b + a\,b + a\,a\,(a^*)\,b)^*\,a\\
+Q_2 & = & (b + a\,b + a\,a\,(a^*)\,b)^*\,a\,a\,(a)^*
+\end{eqnarray}
+
+\noindent Finally, we only need to ``add'' up the equations
+which correspond to a terminal state. In our running example,
+this is just $Q_2$. Consequently, a regular expression
+that recognises the same language as the DFA is
+
+\[
+(b + a\,b + a\,a\,(a^*)\,b)^*\,a\,a\,(a)^*
+\]
+
+\noindent You can somewhat crosscheck your solution by taking a string
+the regular expression can match and and see whether it can be matched
+by the DFA. One string for example is $aaa$ and \emph{voila} this
+string is also matched by the automaton.
+
+We should prove that Brzozowski's method really produces an equivalent
+regular expression. But for the purposes of this module, we omit
+this. I guess you are relieved.
+
+
+\subsection*{Regular Languages}
Given the constructions in the previous sections we obtain
the following overall picture:
@@ -844,7 +1455,7 @@
there exists a regular expression that can recognise the same
language. Again we did not prove this fact.
-The interesting conclusion is that automata and regular
+The fundamental conclusion we can draw is that automata and regular
expressions can recognise the same set of languages:
\begin{quote} A language is \emph{regular} iff there exists a
@@ -857,76 +1468,109 @@
automaton that recognises all its strings.
\end{quote}
-\noindent So for deciding whether a string is recognised by a
-regular expression, we could use our algorithm based on
-derivatives or NFAs or DFAs. But let us quickly look at what
-the differences mean in computational terms. Translating a
-regular expression into a NFA gives us an automaton that has
-$O(n)$ nodes---that means the size of the NFA grows linearly
-with the size of the regular expression. The problem with NFAs
-is that the problem of deciding whether a string is accepted
-or not is computationally not cheap. Remember with NFAs we
-have potentially many next states even for the same input and
-also have the silent $\epsilon$-transitions. If we want to
-find a path from the starting state of an NFA to an accepting
-state, we need to consider all possibilities. In Ruby and
-Python this is done by a depth-first search, which in turn
-means that if a ``wrong'' choice is made, the algorithm has to
-backtrack and thus explore all potential candidates. This is
-exactly the reason why Ruby and Python are so slow for evil
-regular expressions. An alternative to the potentially slow
-depth-first search is to explore the search space in a
-breadth-first fashion, but this might incur a big memory
-penalty.
+\noindent Note that this is not a stement for a particular language
+(that is a particular set of strings), but about a large class of
+languages, namely the regular ones.
-To avoid the problems with NFAs, we can translate them
-into DFAs. With DFAs the problem of deciding whether a
-string is recognised or not is much simpler, because in
-each state it is completely determined what the next
-state will be for a given input. So no search is needed.
-The problem with this is that the translation to DFAs
-can explode exponentially the number of states. Therefore when
-this route is taken, we definitely need to minimise the
-resulting DFAs in order to have an acceptable memory
-and runtime behaviour. But remember the subset construction
-in the worst case explodes the number of states by $2^n$.
-Effectively also the translation to DFAs can incur a big
+As a consequence for deciding whether a string is recognised by a
+regular expression, we could use our algorithm based on derivatives or
+NFAs or DFAs. But let us quickly look at what the differences mean in
+computational terms. Translating a regular expression into a NFA gives
+us an automaton that has $O(n)$ states---that means the size of the
+NFA grows linearly with the size of the regular expression. The
+problem with NFAs is that the problem of deciding whether a string is
+accepted or not is computationally not cheap. Remember with NFAs we
+have potentially many next states even for the same input and also
+have the silent $\epsilon$-transitions. If we want to find a path from
+the starting state of a NFA to an accepting state, we need to consider
+all possibilities. In Ruby, Python and Java this is done by a
+depth-first search, which in turn means that if a ``wrong'' choice is
+made, the algorithm has to backtrack and thus explore all potential
+candidates. This is exactly the reason why Ruby, Python and Java are
+so slow for evil regular expressions. An alternative to the
+potentially slow depth-first search is to explore the search space in
+a breadth-first fashion, but this might incur a big memory penalty.
+
+To avoid the problems with NFAs, we can translate them into DFAs. With
+DFAs the problem of deciding whether a string is recognised or not is
+much simpler, because in each state it is completely determined what
+the next state will be for a given input. So no search is needed. The
+problem with this is that the translation to DFAs can explode
+exponentially the number of states. Therefore when this route is
+taken, we definitely need to minimise the resulting DFAs in order to
+have an acceptable memory and runtime behaviour. But remember the
+subset construction in the worst case explodes the number of states by
+$2^n$. Effectively also the translation to DFAs can incur a big
runtime penalty.
-But this does not mean that everything is bad with automata.
-Recall the problem of finding a regular expressions for the
-language that is \emph{not} recognised by a regular
-expression. In our implementation we added explicitly such a
-regular expressions because they are useful for recognising
-comments. But in principle we did not need to. The argument
-for this is as follows: take a regular expression, translate
+But this does not mean that everything is bad with automata. Recall
+the problem of finding a regular expressions for the language that is
+\emph{not} recognised by a regular expression. In our implementation
+we added explicitly such a regular expressions because they are useful
+for recognising comments. But in principle we did not need to. The
+argument for this is as follows: take a regular expression, translate
it into a NFA and then a DFA that both recognise the same
-language. Once you have the DFA it is very easy to construct
-the automaton for the language not recognised by an DFA. If
-the DFA is completed (this is important!), then you just need
-to exchange the accepting and non-accepting states. You can
-then translate this DFA back into a regular expression and
-that will be the regular expression that can match all strings
-the original regular expression could \emph{not} match.
+language. Once you have the DFA it is very easy to construct the
+automaton for the language not recognised by a DFA. If the DFA is
+completed (this is important!), then you just need to exchange the
+accepting and non-accepting states. You can then translate this DFA
+back into a regular expression and that will be the regular expression
+that can match all strings the original regular expression could
+\emph{not} match.
-It is also interesting that not all languages are regular. The
-most well-known example of a language that is not regular
-consists of all the strings of the form
+It is also interesting that not all languages are regular. The most
+well-known example of a language that is not regular consists of all
+the strings of the form
\[a^n\,b^n\]
-\noindent meaning strings that have the same number of $a$s
-and $b$s. You can try, but you cannot find a regular
-expression for this language and also not an automaton. One
-can actually prove that there is no regular expression nor
-automaton for this language, but again that would lead us too
-far afield for what we want to do in this module.
+\noindent meaning strings that have the same number of $a$s and
+$b$s. You can try, but you cannot find a regular expression for this
+language and also not an automaton. One can actually prove that there
+is no regular expression nor automaton for this language, but again
+that would lead us too far afield for what we want to do in this
+module.
+
+
+\subsection*{Where Have Derivatives Gone?}
+
+By now you are probably fed up with this text. It is now way too long
+for one lecture, but there is still one aspect of the
+automata-regular-expression-connection I like to describe. Perhaps by
+now you are asking yourself: Where have the derivatives gone? Did we
+just forget them? Well, they have a place in the picture of
+calculating a DFA from the regular expression.
+
+To be done
-\section*{Further Reading}
+\begin{center}
+\begin{tikzpicture}
+ [level distance=25mm,very thick,auto,
+ level 1/.style={sibling distance=30mm},
+ level 2/.style={sibling distance=15mm},
+ every node/.style={minimum size=30pt,
+ inner sep=0pt,circle,draw=blue!50,very thick,
+ fill=blue!20}]
+ \node {$r$} [grow=right]
+ child[->] {node (cn) {$d_{c_n}$}
+ child { node {$dd_{c_nc_n}$}}
+ child { node {$dd_{c_nc_1}$}}
+ %edge from parent node[left] {$c_n$}
+ }
+ child[->] {node (c1) {$d_{c_1}$}
+ child { node {$dd_{c_1c_n}$}}
+ child { node {$dd_{c_1c_1}$}}
+ %edge from parent node[left] {$c_1$}
+ };
+ %%\draw (cn) -- (c1) node {\vdots};
+\end{tikzpicture}
+\end{center}
-Compare what a ``human expert'' would create as an automaton for the
-regular expression $a (b + c)^*$ and what the Thomson
-algorithm generates.
+%\section*{Further Reading}
+
+%Compare what a ``human expert'' would create as an automaton for the
+%regular expression $a\cdot (b + c)^*$ and what the Thomson
+%algorithm generates.
%http://www.inf.ed.ac.uk/teaching/courses/ct/
\end{document}
@@ -935,3 +1579,5 @@
%%% mode: latex
%%% TeX-master: t
%%% End:
+
+
Binary file handouts/ho04.pdf has changed
--- a/handouts/ho04.tex Tue Sep 26 14:07:29 2017 +0100
+++ b/handouts/ho04.tex Tue Sep 26 14:08:49 2017 +0100
@@ -91,11 +91,13 @@
letters, while values just start with a single upper-case
character and the rest are lower-case letters.
-{\small\lstinputlisting[language=Scala,numbers=none]
+{\small\lstinputlisting[language=Scala,numbers=none,linebackgroundcolor=
+ {\ifodd\value{lstnumber}\color{capri!3}\fi}]
{../progs/app01.scala}}
-{\small\lstinputlisting[language=Scala,numbers=none]
+{\small\lstinputlisting[language=Scala,numbers=none,linebackgroundcolor=
+ {\ifodd\value{lstnumber}\color{capri!3}\fi}]
{../progs/app02.scala}}
@@ -580,7 +582,9 @@
\end{figure}
\begin{figure}[p]
-\lstinputlisting{../progs/app61.scala}
+\small
+\lstinputlisting[numbers=left,linebackgroundcolor=
+ {\ifodd\value{lstnumber}\color{capri!3}\fi}]{../progs/app61.scala}
\caption{The Scala code for the simplification function. The
first part defines some auxillary functions for the rectification.
@@ -639,7 +643,8 @@
be regarded as a regular expression. The extended definition
in Scala therefore looks as follows:
-{\small\lstinputlisting[language=Scala]
+{\small\lstinputlisting[language=Scala, numbers=none,linebackgroundcolor=
+ {\ifodd\value{lstnumber}\color{capri!3}\fi}]
{../progs/app03.scala}}
\noindent Since we regard records as regular expressions we
@@ -648,7 +653,8 @@
to extend the definition of values, which in Scala looks as
follows:
-{\small\lstinputlisting[language=Scala]
+{\small\lstinputlisting[language=Scala, numbers=none,linebackgroundcolor=
+ {\ifodd\value{lstnumber}\color{capri!3}\fi}]
{../progs/app04.scala}}
\noindent Let us now look at the purpose of records more
Binary file handouts/ho05.pdf has changed
Binary file handouts/ho06.pdf has changed
Binary file handouts/ho07.pdf has changed
Binary file handouts/ho08.pdf has changed
--- a/handouts/ho08.tex Tue Sep 26 14:07:29 2017 +0100
+++ b/handouts/ho08.tex Tue Sep 26 14:08:49 2017 +0100
@@ -8,6 +8,10 @@
% modern compilers are different
% https://channel9.msdn.com/Blogs/Seth-Juarez/Anders-Hejlsberg-on-Modern-Compiler-Construction
+%halting problem
+%https://dfilaretti.github.io/2017-04-30/halting-problem
+
+
\begin{document}
\lstset{morekeywords={def,if,then,else,write,read},keywordstyle=\color{codepurple}\bfseries}
Binary file handouts/notation.pdf has changed
--- a/handouts/notation.tex Tue Sep 26 14:07:29 2017 +0100
+++ b/handouts/notation.tex Tue Sep 26 14:08:49 2017 +0100
@@ -8,15 +8,14 @@
\section*{A Crash-Course on Notation}
-There are innumerable books available about compilers,
-automata theory and formal languages. Unfortunately, they
-often use their own notational conventions and their own
-symbols. This handout is meant to clarify some of the notation
-I will use. I appologise in advance that sometimes I will be a
-bit fuzzy\ldots the problem is that often we want to have
-convenience in our mathematical definitions (to make them
-readable and understandable), but sometimes we need precision
-for actual programs.
+There are innumerable books available about compilers, automata theory
+and formal languages. Unfortunately, they often use their own
+notational conventions and their own symbols. This handout is meant to
+clarify some of the notation I will use. I apologise in advance that
+sometimes I will be a bit fuzzy\ldots the problem is that often we
+want to have convenience in our mathematical definitions (to make them
+readable and understandable), but other times we need pedantic
+precision for actual programs.
\subsubsection*{Characters and Strings}
@@ -55,7 +54,7 @@
Why do I make this distinction? Because we often need to
define functions using variables ranging over characters. We
need to somehow say this is a variable, say $c$, ranging over
-characters, while this is the atual character \pcode{c}.
+characters, while this is the actual character \pcode{c}.
An \defn{alphabet} is a (non-empty) finite set of characters.
Often the letter $\Sigma$ is used to refer to an alphabet. For
@@ -70,14 +69,14 @@
\defn{Strings} are lists of characters. Unfortunately, there
are many ways how we can write down strings. In programming
-languages, they are usually written as \dq{$hello$} where the
+languages, they are usually written as \dq{\texttt{hello}} where the
double quotes indicate that we are dealing with a string. In
-programming languages, such as Scala, strings have a special
+typed programming languages, such as Scala, strings have a special
type---namely \pcode{String} which is different from the type
-for lists of chatacters. This is because strings can be
-efficiently represented in memory, unlike general lists. Since
-\code{String} and the type of lists of characters,
-\code{List[Char]} are not the same, we need to explicitly
+for lists of characters. This is because strings can be
+efficiently represented in memory, unlike lists. Since
+\code{String} and the type of lists of characters
+(\code{List[Char]}) are not the same, we need to explicitly
coerce elements between the two types, for example
\begin{lstlisting}[numbers=none]
@@ -85,9 +84,11 @@
res01: List[Char] = List(a, b, c)
\end{lstlisting}
-\noindent Since in our (mathematical) definitions we regard
-strings as lists of characters, we will also write
-\dq{$hello$} as
+\noindent
+However, we do not want to do this kind of explicit coercion in our
+pencil-and-paper, everyday arguments. So in our (mathematical)
+definitions we regard strings as lists of characters, we will also
+write \dq{$hello$} as
\[
[\text{\it h, e, l, l, o}] \qquad\text{or simply}\qquad \textit{hello}
@@ -107,7 +108,7 @@
which we write as $s_1 @ s_2$. If we regard $s_1$ and $s_2$ as
lists of characters, then $@$ is the list-append function.
Suppose we are given two strings \dq{\textit{foo}} and
-\dq{\textit{bar}}, then their concatenation, writen
+\dq{\textit{bar}}, then their concatenation, written
\dq{\textit{foo}} $@$ \dq{\textit{bar}}, gives
\dq{\textit{foobar}}. But as said above, we will often
simplify our life and just drop the double quotes whenever it
@@ -115,12 +116,10 @@
write \textit{foo}, \textit{bar}, \textit{foobar} or
\textit{foo $@$ bar}.
-Occasionally we will use the notation $a^n$ for strings, which
-stands for the string of $n$ repeated $a$s. So $a^{n}b^{n}$ is
-a string that has as many $a$s by as many $b$s.
-
-A simple property of string concatenation is
-\emph{associativity}, meaning
+Occasionally we will use the notation $a^n$ for strings, which stands
+for the string of $n$ repeated $a$s. So $a^{n}b^{n}$ is a string that
+has some number of $a$s followed by the same number of $b$s. A simple
+property of string concatenation is \emph{associativity}, meaning
\[(s_1 @ s_2) @ s_3 = s_1 @ (s_2 @ s_3)\]
@@ -142,15 +141,20 @@
\{1, 2, 3\}
\]
-\noindent The notation $\in$ means \emph{element of}, so $1
-\in \{1, 2, 3\}$ is true and $4 \in \{1, 2, 3\}$ is false.
-Sets can potentially have infinitely many elements. For
-example the set of all natural numbers $\{0, 1, 2, \ldots\}$
-is infinite. This set is often also abbreviated as
-$\mathbb{N}$. We can define sets by giving all elements, for
-example $\{0, 1\}$ for the set containing just $0$ and $1$,
-but also by \defn{set comprehensions}. For example the set of
-all even natural numbers can be defined as
+\noindent The notation $\in$ means \emph{element of}, so $1 \in \{1,
+2, 3\}$ is true and $4 \in \{1, 2, 3\}$ is false. Note that the
+\emph{list} $[1, 2, 3]$ is something different from the \emph{set}
+$\{1, 2, 3\}$: in the former we care about the order and potentially
+several occurrences of a number; while with the latter we do not.
+Also sets can potentially have infinitely many elements, whereas lists
+cannot. For example
+the set of all natural numbers $\{0, 1, 2, \ldots\}$ is infinite. This
+set is often also abbreviated as $\mathbb{N}$. Lists can be very large, but they cannot contain infinitely many elements.
+
+We can define sets by giving all elements, for example $\{0, 1\}$ for
+the set containing just $0$ and $1$, but also by \defn{set
+ comprehensions}. For example the set of all even natural numbers can
+be defined as
\[
\{n\;|\;n\in\mathbb{N} \wedge n\;\text{is even}\}
@@ -163,8 +167,9 @@
\{n\;|\; n^2 < 9 \wedge n \in \mathbb{N}\}
\]
-\noindent Notice that set comprehensions could be used
-to define set union, intersection and difference:
+\noindent Can you see why this defines the set $\{0, 1, 2\}$? Notice
+that set comprehensions could be used to define set union,
+intersection and difference:
\begin{eqnarray*}
A \cup B & \dn & \{x\;|\; x \in A \vee x \in B\}\\
@@ -201,13 +206,30 @@
\noindent but using the big union notation is more concise.
-While this stuff about sete might all look trivial or even
-needlessly pedantic, \emph{Nature} is never simple. If you
-want to be amazed how complicated sets can get, watch out for
-the last lecture just before Christmas where I want you to
-convince you of the fact that some sets are more infinite than
-others. Actually that will be a fact that is very relevant to
-the material of this module.
+As an aside: While this stuff about sets might all look trivial or even needlessly
+pedantic, \emph{Nature} is never simple. If you want to be amazed how
+complicated sets can get, watch out for the last lecture just before
+Christmas where I want to convince you of the fact that some sets are
+more infinite than others. Yes, you read correctly, there can be sets
+that are ``more infinite'' then others. If you think this is obvious:
+say you have the infinite set $\mathbb{N}\backslash\{0\} = \{1, 2, 3, 4, \ldots\}$ which is all
+the natural numbers except $0$, and then compare it to the set
+$\{0, 1, 2, 3, 4, \ldots\}$ which contains the $0$. If you think,
+the second must be more infinite\ldots{} well, then think again. Because the two
+infinite sets
+
+\begin{center}
+ $\{1, 2, 3, 4, \ldots\}$ and
+ $\{0, 1, 2, 3, 4, \ldots\}$
+\end{center}
+
+\noindent
+contain actually the same number of elements. Does this make sense?
+Though this might all look strange this about infinite sets will be a
+topic that is very relevant to the material of this module. It tells
+us what we can compute with a computer (actually algorithm) and what
+we cannot. But during the first 9 lectures we can go by without this
+``weird'' stuff.
Another important notion in this module are \defn{languages}, which
are sets of strings. One of the main goals for us will be how to
@@ -254,7 +276,7 @@
set behaves like $0$ for multiplication and the set $\{[]\}$
like $1$ for multiplication ($n * 1 = n$ and $n * 0 = 0$).
-Following the language concatenation, we can define a
+Using the operation of language concatenation, we can define a
\defn{language power} operation as follows:
\begin{eqnarray*}
@@ -267,7 +289,7 @@
set, but the set containing the empty string. So no matter
what the set $A$ is, $A^0$ will always be $\{[]\}$. (There is
another hint about a connection between the $@$-operation and
-multiplication: How is $x^n$ defined recursively and what is
+multiplication: How is $x^n$ defined in mathematics and what is
$x^0$?)
Next we can define the \defn{star operation} for languages:
@@ -282,14 +304,14 @@
gives
\[
-A^0 \cup A^1 \cup A^2 \cup A^3 \cup \ldots
+A\star \dn A^0 \cup A^1 \cup A^2 \cup A^3 \cup \ldots
\]
\noindent
which is equal to
\[
-\{[]\} \,\cup\, A \,\cup\, A @ A \,\cup\, A @ A @ A \,\cup\, \ldots
+A\star \dn \{[]\} \,\cup\, A \,\cup\, A @ A \,\cup\, A @ A @ A \,\cup\, \ldots
\]
\noindent We can see that the empty string is always in $A\star$,
@@ -299,7 +321,7 @@
Recall that an alphabet is often referred to by the letter
$\Sigma$. We can now write for the set of \emph{all} strings
-over this alphabet $\Sigma\star$. In doing so we also include the
+over this alphabet as $\Sigma\star$. In doing so we also include the
empty string as a possible string over $\Sigma$. So if $\Sigma
= \{a, b\}$, then $\Sigma\star$ is
Binary file handouts/scala-ho.pdf has changed
--- a/handouts/scala-ho.tex Tue Sep 26 14:07:29 2017 +0100
+++ b/handouts/scala-ho.tex Tue Sep 26 14:08:49 2017 +0100
@@ -982,6 +982,12 @@
inherited from the JVM that can be really annoying. For
example a fixed stack size. One can work around this
particular limitation, but why does one have to?
+More such `puzzles' can be found at
+
+\begin{center}
+ \url{http://scalapuzzlers.com} and
+ \url{http://latkin.org/blog/2017/05/02/when-the-scala-compiler-doesnt-help/}
+\end{center}
Even if Scala has been a success in several high-profile
companies, there is also a company (Yammer) that first used
Binary file hws/hw01.pdf has changed
--- a/hws/hw01.tex Tue Sep 26 14:07:29 2017 +0100
+++ b/hws/hw01.tex Tue Sep 26 14:08:49 2017 +0100
@@ -29,10 +29,11 @@
\item Read the handout of the first lecture and the handout
about notation. Make sure you understand the concepts of
- strings and languages. In the context of the AFL-course,
+ strings and languages. In the context of the CFL-course,
what is meant by the term \emph{language}?
-\item Give the definition for regular expressions. What is the
+ \item Give the definition for regular expressions---this is an
+ inductive datatype. What is the
meaning of a regular expression? (Hint: The meaning is
defined recursively.)
@@ -41,6 +42,7 @@
\emph{concatenating} two sets of strings. This operation
is also written as $\_ \,@\, \_$. According to
this definition, what is $A \,@\, \{\}$ equal to?
+ Is in general $A\,@\,B$ equal to $B\,@\,A$?
\item Assume a set $A$ contains 4 strings and a set $B$
contains 7 strings. None of the strings is the empty
Binary file hws/hw02.pdf has changed
--- a/hws/hw02.tex Tue Sep 26 14:07:29 2017 +0100
+++ b/hws/hw02.tex Tue Sep 26 14:08:49 2017 +0100
@@ -8,7 +8,9 @@
\HEADER
\begin{enumerate}
-
+\item What is the difference between \emph{basic} regular expressions
+ and \emph{extended} regular expressions?
+
\item What is the language recognised by the regular
expressions $(\ZERO^*)^*$.
Binary file hws/hw03.pdf has changed
Binary file hws/hw04.pdf has changed
--- a/hws/hw04.tex Tue Sep 26 14:07:29 2017 +0100
+++ b/hws/hw04.tex Tue Sep 26 14:08:49 2017 +0100
@@ -97,7 +97,13 @@
\item What is the purpose of the record regular expression in
the Sulzmann \& Lu algorithm?
-
+
+\item Recall the functions \textit{nullable} and \textit{zeroable}.
+ Define recursive functions \textit{atmostempty} (for regular expressions
+ that match no string or only the empty string), \textit{somechars} (for regular
+ expressions that match some non-empty string), \textit{infinitestrings} (for regular
+ expressions that can match infinitely many strings).
+
%\item (Optional) The tokenizer in \texttt{regexp3.scala} takes as
%argument a string and a list of rules. The result is a list of tokens. Improve this tokenizer so
Binary file hws/hw05.pdf has changed
Binary file hws/hw06.pdf has changed
Binary file hws/hw07.pdf has changed
Binary file hws/hw08.pdf has changed
Binary file hws/hw09.pdf has changed
Binary file hws/proof.pdf has changed
--- a/langs.sty Tue Sep 26 14:07:29 2017 +0100
+++ b/langs.sty Tue Sep 26 14:08:49 2017 +0100
@@ -39,13 +39,27 @@
otherkeywords={=,!=,:=,<,>,\%;*,/},
}[keywords,comments,strings]
+
+\newcommand{\code}[1]{{\lstinline{#1}}}
+\newcommand{\pcode}[1]{\mbox{\lstset{language={},keywordstyle=\color{black}}\lstinline!#1!}}
+\newcommand{\scode}[1]{\mbox{\lstset{language={},basicstyle=\ttfamily\color{codegreen}}\lstinline!#1!}}
+\makeatother
+
+%%\lstset{escapeinside={(*@}{@*)}}
+\lstset{escapeinside={/*@}{@*/}}
+
+%% stripy code
+\usepackage{lstlinebgrd}
+\definecolor{capri}{rgb}{0.0, 0.75, 1.0}
+
+
\lstdefinestyle{mystyle}
{basicstyle=\ttfamily,
keywordstyle=\color{codepurple}\bfseries,
stringstyle=\color{codegreen},
commentstyle=\color{codegreen},
morecomment=[s][\color{codedocblue}]{/**}{*/},
- numbers=left,
+ numbers=none,
numberstyle=\tiny\color{black},
stepnumber=1,
numbersep=10pt,
@@ -54,17 +68,9 @@
showstringspaces=false,
xleftmargin=8mm,
emphstyle=\color{codeblue}\bfseries,
- keepspaces
+ keepspaces,
+ linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}
}
\lstset{language=Scala,
style=mystyle}
-
-
-\newcommand{\code}[1]{{\lstinline{#1}}}
-\newcommand{\pcode}[1]{\mbox{\lstset{language={},keywordstyle=\color{black}}\lstinline!#1!}}
-\newcommand{\scode}[1]{\mbox{\lstset{language={},basicstyle=\ttfamily\color{codegreen}}\lstinline!#1!}}
-\makeatother
-
-%%\lstset{escapeinside={(*@}{@*)}}
-\lstset{escapeinside={/*@}{@*/}}
--- a/progs/Matcher.thy Tue Sep 26 14:07:29 2017 +0100
+++ b/progs/Matcher.thy Tue Sep 26 14:08:49 2017 +0100
@@ -6,8 +6,8 @@
section {* Regular Expressions *}
datatype rexp =
- NULL
-| EMPTY
+ ZERO
+| ONE
| CHAR char
| SEQ rexp rexp
| ALT rexp rexp
@@ -63,8 +63,8 @@
fun
L :: "rexp \<Rightarrow> string set"
where
- "L (NULL) = {}"
-| "L (EMPTY) = {[]}"
+ "L (ZERO) = {}"
+| "L (ONE) = {[]}"
| "L (CHAR c) = {[c]}"
| "L (SEQ r1 r2) = (L r1) ;; (L r2)"
| "L (ALT r1 r2) = (L r1) \<union> (L r2)"
@@ -75,180 +75,61 @@
fun
nullable :: "rexp \<Rightarrow> bool"
where
- "nullable (NULL) = False"
-| "nullable (EMPTY) = True"
+ "nullable (ZERO) = False"
+| "nullable (ONE) = True"
| "nullable (CHAR c) = False"
| "nullable (ALT r1 r2) = (nullable r1 \<or> nullable r2)"
| "nullable (SEQ r1 r2) = (nullable r1 \<and> nullable r2)"
| "nullable (STAR r) = True"
-fun
- noccurs :: "rexp \<Rightarrow> bool"
-where
- "noccurs (NULL) = True"
-| "noccurs (EMPTY) = False"
-| "noccurs (CHAR c) = False"
-| "noccurs (ALT r1 r2) = (noccurs r1 \<or> noccurs r2)"
-| "noccurs (SEQ r1 r2) = (noccurs r1 \<or> noccurs r2)"
-| "noccurs (STAR r) = (noccurs r)"
-lemma
- "\<not> noccurs r \<Longrightarrow> L r \<noteq> {}"
-apply(induct r)
-apply(auto simp add: Seq_def)
-done
-
-lemma
- "L r = {} \<Longrightarrow> noccurs r"
-apply(induct r)
-apply(auto simp add: Seq_def)
-done
-
-lemma does_not_hold:
- "noccurs r \<Longrightarrow> L r = {}"
-apply(induct r)
-apply(auto simp add: Seq_def)
-oops
-
-fun
- der :: "char \<Rightarrow> rexp \<Rightarrow> rexp"
-where
- "der c (NULL) = NULL"
-| "der c (EMPTY) = NULL"
-| "der c (CHAR c') = (if c = c' then EMPTY else NULL)"
-| "der c (ALT r1 r2) = ALT (der c r1) (der c r2)"
-| "der c (SEQ r1 r2) = ALT (SEQ (der c r1) r2) (if nullable r1 then der c r2 else NULL)"
-| "der c (STAR r) = SEQ (der c r) (STAR r)"
-
-fun
- ders :: "string \<Rightarrow> rexp \<Rightarrow> rexp"
-where
- "ders [] r = r"
-| "ders (c # s) r = ders s (der c r)"
-
-fun
- matcher :: "rexp \<Rightarrow> string \<Rightarrow> bool"
-where
- "matcher r s = nullable (ders s r)"
-
-
-section {* Correctness Proof of the Matcher *}
+section {* Correctness Proof for Nullable *}
lemma nullable_correctness:
shows "nullable r \<longleftrightarrow> [] \<in> (L r)"
-by (induct r) (auto simp add: Seq_def)
-section {* Left-Quotient of a Set *}
-
-fun
- zeroable :: "rexp \<Rightarrow> bool"
-where
- "zeroable (NULL) = True"
-| "zeroable (EMPTY) = False"
-| "zeroable (CHAR c) = False"
-| "zeroable (ALT r1 r2) = (zeroable r1 \<and> zeroable r2)"
-| "zeroable (SEQ r1 r2) = (zeroable r1 \<or> zeroable r2)"
-| "zeroable (STAR r) = False"
-
-
-lemma zeroable_correctness:
- shows "zeroable r \<longleftrightarrow> (L r = {})"
apply(induct r)
-apply(auto simp add: Seq_def)[6]
-done
-
-section {* Left-Quotient of a Set *}
-
-definition
- Der :: "char \<Rightarrow> string set \<Rightarrow> string set"
-where
- "Der c A \<equiv> {s. [c] @ s \<in> A}"
-
-lemma Der_null [simp]:
- shows "Der c {} = {}"
-unfolding Der_def
-by auto
-
-lemma Der_empty [simp]:
- shows "Der c {[]} = {}"
-unfolding Der_def
-by auto
-
-lemma Der_char [simp]:
- shows "Der c {[d]} = (if c = d then {[]} else {})"
-unfolding Der_def
-by auto
-
-lemma Der_union [simp]:
- shows "Der c (A \<union> B) = Der c A \<union> Der c B"
-unfolding Der_def
-by auto
-
-lemma Der_seq [simp]:
- shows "Der c (A ;; B) = (Der c A) ;; B \<union> (if [] \<in> A then Der c B else {})"
-unfolding Der_def Seq_def
-by (auto simp add: Cons_eq_append_conv)
+(* ZERO case *)
+apply(simp only: nullable.simps)
+apply(simp only: L.simps)
+apply(simp)
+(* ONE case *)
+apply(simp only: nullable.simps)
+apply(simp only: L.simps)
+apply(simp)
+(* CHAR case *)
+apply(simp only: nullable.simps)
+apply(simp only: L.simps)
+apply(simp)
+prefer 2
+(* ALT case *)
+apply(simp (no_asm) only: nullable.simps)
+apply(simp only:)
+apply(simp only: L.simps)
+apply(simp)
+(* SEQ case *)
+oops
-lemma Der_star [simp]:
- shows "Der c (A\<star>) = (Der c A) ;; A\<star>"
-proof -
- have "Der c (A\<star>) = Der c ({[]} \<union> A ;; A\<star>)"
- by (simp only: star_cases[symmetric])
- also have "... = Der c (A ;; A\<star>)"
- by (simp only: Der_union Der_empty) (simp)
- also have "... = (Der c A) ;; A\<star> \<union> (if [] \<in> A then Der c (A\<star>) else {})"
- by simp
- also have "... = (Der c A) ;; A\<star>"
- unfolding Seq_def Der_def
- by (auto dest: star_decomp)
- finally show "Der c (A\<star>) = (Der c A) ;; A\<star>" .
-qed
-
-
-lemma der_correctness:
- shows "L (der c r) = Der c (L r)"
-by (induct r)
- (simp_all add: nullable_correctness)
-
-lemma matcher_correctness:
- shows "matcher r s \<longleftrightarrow> s \<in> L r"
-by (induct s arbitrary: r)
- (simp_all add: nullable_correctness der_correctness Der_def)
-
-section {* Examples *}
-
-definition
- "CHRA \<equiv> CHAR (CHR ''a'')"
+lemma nullable_correctness:
+ shows "nullable r \<longleftrightarrow> [] \<in> (L r)"
+apply(induct r)
+apply(simp_all)
+(* all easy subgoals are proved except the last 2 *)
+(* where the definition of Seq needs to be unfolded. *)
+oops
-definition
- "ALT1 \<equiv> ALT CHRA EMPTY"
-
-definition
- "SEQ3 \<equiv> SEQ (SEQ ALT1 ALT1) ALT1"
-
-value "matcher SEQ3 ''aaa''"
+lemma nullable_correctness:
+ shows "nullable r \<longleftrightarrow> [] \<in> (L r)"
+apply(induct r)
+apply(simp_all add: Seq_def)
+(* except the star case every thing is proved *)
+(* we need to use the rule for Star.start *)
+oops
-value "matcher NULL []"
-value "matcher (CHAR (CHR ''a'')) [CHR ''a'']"
-value "matcher (CHAR a) [a,a]"
-value "matcher (STAR (CHAR a)) []"
-value "matcher (STAR (CHAR a)) [a,a]"
-value "matcher (SEQ (CHAR (CHR ''a'')) (SEQ (STAR (CHAR (CHR ''b''))) (CHAR (CHR ''c'')))) ''abbbbc''"
-value "matcher (SEQ (CHAR (CHR ''a'')) (SEQ (STAR (CHAR (CHR ''b''))) (CHAR (CHR ''c'')))) ''abbcbbc''"
-
-section {* Incorrect Matcher - fun-definition rejected *}
-
-fun
- match :: "rexp list \<Rightarrow> string \<Rightarrow> bool"
-where
- "match [] [] = True"
-| "match [] (c # s) = False"
-| "match (NULL # rs) s = False"
-| "match (EMPTY # rs) s = match rs s"
-| "match (CHAR c # rs) [] = False"
-| "match (CHAR c # rs) (d # s) = (if c = d then match rs s else False)"
-| "match (ALT r1 r2 # rs) s = (match (r1 # rs) s \<or> match (r2 # rs) s)"
-| "match (SEQ r1 r2 # rs) s = match (r1 # r2 # rs) s"
-| "match (STAR r # rs) s = (match rs s \<or> match (r # (STAR r) # rs) s)"
+lemma nullable_correctness:
+ shows "nullable r \<longleftrightarrow> [] \<in> (L r)"
+apply(induct r)
+apply(simp_all add: Seq_def Star.start)
+done
end
\ No newline at end of file
--- a/progs/catastrophic.java Tue Sep 26 14:07:29 2017 +0100
+++ b/progs/catastrophic.java Tue Sep 26 14:08:49 2017 +0100
@@ -8,7 +8,7 @@
public class catastrophic {
public static void main(String[] args) {
- //we always run all the tests twice -> warmup of the JVM
+ //we always run all the tests twice -> to warmup of the JVM
for (int runs = 0; runs < 2; runs++) {
Pattern pattern = Pattern.compile("(a*)*b");
@@ -25,7 +25,8 @@
for (int i = 0; i < 2; i++) {
pattern.matcher(input).find();
}
-
+
+ // Print out time
System.out.println(length + " " + input + ": "
+ ((System.nanoTime() - start) / 2000000000d)
+ "s");
--- a/progs/catastrophic.rb Tue Sep 26 14:07:29 2017 +0100
+++ b/progs/catastrophic.rb Tue Sep 26 14:08:49 2017 +0100
@@ -2,18 +2,19 @@
nums = (1..1000)
-#iterate through the nums 1-100
+#iterate through the nums 1-1000
nums.each do |i|
start_time = Time.now
string = "a" * i
+
+ #create a new regular expression based on current value of i
re_str = "a?" * i + "+" + "a" * i
- #create a new regular expression based on current value of i
-
re = Regexp.new(re_str)
re.match(string)
- #if re.match(string)
+
+ #if re.match(string)
# puts "matched string a * #{i} with regex #{re}"
#else
# puts "unmatched string a * #{i} with regex #{re}"
--- a/progs/cw1.scala Tue Sep 26 14:07:29 2017 +0100
+++ b/progs/cw1.scala Tue Sep 26 14:08:49 2017 +0100
@@ -1,137 +1,90 @@
+// CW 1
import scala.language.implicitConversions
import scala.language.reflectiveCalls
-abstract class Rexp {
- def simp : Rexp = this
-}
+abstract class Rexp
+case object ZERO extends Rexp // matches nothing
+case object ONE extends Rexp // matches the empty string
+case class CHAR(c: Char) extends Rexp // matches a character c
+case class ALT(r1: Rexp, r2: Rexp) extends Rexp // alternative
+case class SEQ(r1: Rexp, r2: Rexp) extends Rexp // sequence
+case class STAR(r: Rexp) extends Rexp // star
+case class RANGE(cs: Set[Char]) extends Rexp // set of characters
+case class PLUS(r: Rexp) extends Rexp // plus
+case class OPT(r: Rexp) extends Rexp // optional
+case class NTIMES(r: Rexp, n: Int) extends Rexp // n-times
+case class UPNTIMES(r: Rexp, n: Int) extends Rexp // up n-times
+case class FROMNTIMES(r: Rexp, n: Int) extends Rexp // from n-times
+case class NMTIMES(r: Rexp, n: Int, m: Int) extends Rexp // between nm-times
+case class NOT(r: Rexp) extends Rexp // not
+case class CFUN(f: Char => Boolean) extends Rexp // subsuming CHAR and RANGE
-case object NULL extends Rexp
-case object EMPTY extends Rexp
-case class CHAR(c: Char) extends Rexp
-case class ALT(r1: Rexp, r2: Rexp) extends Rexp {
- override def toString = r1.toString + " | " + r2.toString
- override def simp = (r1.simp, r2.simp) match {
- case (NULL, r) => r
- case (r, NULL) => r
- case (r, EMPTY) => if (nullable(r)) r else ALT(r, EMPTY)
- case (EMPTY, r) => if (nullable(r)) r else ALT(r, EMPTY)
- case (r1, r2) => if (r1 == r2) r1 else ALT(r1, r2)
- }
-}
-case class RANGE(cs: List[Char]) extends Rexp {
- //override def toString = cs.mkString("[",",","]")
- override def toString = "[" + cs.head + " .. " + cs.last + "]"
-}
-object RANGE {
- def apply(s: String) : RANGE = RANGE(s.toList)
-}
-case class SEQ(r1: Rexp, r2: Rexp) extends Rexp {
- override def toString = r1.toString + " ~ " + r2.toString
- override def simp = (r1.simp, r2.simp) match {
- case (NULL, _) => NULL
- case (_, NULL) => NULL
- case (EMPTY, r) => r
- case (r, EMPTY) => r
- case (r1, r2) => SEQ(r1, r2)
- }
-}
-case class PLUS(r: Rexp) extends Rexp {
- override def simp = r.simp match {
- case NULL => EMPTY
- case EMPTY => EMPTY
- case r => PLUS(r)
- }
-}
-case class STAR(r: Rexp) extends Rexp {
- override def simp = r.simp match {
- case NULL => EMPTY
- case EMPTY => EMPTY
- case r => STAR(r)
- }
-}
-case class NTIMES(r: Rexp, n: Int) extends Rexp {
- override def simp = if (n == 0) EMPTY else
- r.simp match {
- case NULL => NULL
- case EMPTY => EMPTY
- case r => NTIMES(r, n)
- }
-}
-case class NUPTOM(r: Rexp, n: Int, m: Int) extends Rexp {
- override def simp = if (m == 0) EMPTY else
- r.simp match {
- case NULL => NULL
- case EMPTY => EMPTY
- case r => NUPTOM(r, n, m)
- }
-}
-def NMTIMES(r: Rexp, n: Int, m: Int) = {
- if(m < n) throw new IllegalArgumentException("the number m cannot be smaller than n.")
- else NUPTOM(r, n, m - n)
-}
-
-
-case class NOT(r: Rexp) extends Rexp {
- override def simp = NOT(r.simp)
-}
-case class OPT(r: Rexp) extends Rexp {
- override def simp = OPT(r.simp)
-}
// nullable function: tests whether the regular
// expression can recognise the empty string
def nullable (r: Rexp) : Boolean = r match {
- case NULL => false
- case EMPTY => true
+ case ZERO => false
+ case ONE => true
case CHAR(_) => false
case ALT(r1, r2) => nullable(r1) || nullable(r2)
case SEQ(r1, r2) => nullable(r1) && nullable(r2)
case STAR(_) => true
+ case RANGE(_) => false
case PLUS(r) => nullable(r)
- case NTIMES(r, i) => if (i == 0) true else nullable(r)
- case NUPTOM(r, i, j) => if (i == 0) true else nullable(r)
- case RANGE(_) => false
- case NOT(r) => !(nullable(r))
case OPT(_) => true
-}
-
-def zeroable (r: Rexp) : Boolean = r match {
- case NULL => true
- case EMPTY => false
- case CHAR(_) => false
- case ALT(r1, r2) => zeroable(r1) && zeroable(r2)
- case SEQ(r1, r2) => zeroable(r1) || zeroable(r2)
- case STAR(_) => false
- case PLUS(r) => zeroable(r)
- case NTIMES(r, i) => if (i == 0) false else zeroable(r)
- case NUPTOM(r, i, j) => if (i == 0) false else zeroable(r)
- case RANGE(_) => false
- case NOT(r) => !(zeroable(r))
- case OPT(_) => false
+ case NTIMES(r, n) => if (n == 0) true else nullable(r)
+ case UPNTIMES(_, _) => true
+ case FROMNTIMES(r, n) => if (n == 0) true else nullable(r)
+ case NMTIMES(r, n, m) => if (n == 0) true else nullable(r)
+ case NOT(r) => !(nullable(r))
+ case CFUN(_) => false
}
// derivative of a regular expression w.r.t. a character
def der (c: Char, r: Rexp) : Rexp = r match {
- case NULL => NULL
- case EMPTY => NULL
- case CHAR(d) => if (c == d) EMPTY else NULL
+ case ZERO => ZERO
+ case ONE => ZERO
+ case CHAR(d) => if (c == d) ONE else ZERO
case ALT(r1, r2) => ALT(der(c, r1), der(c, r2))
case SEQ(r1, r2) =>
if (nullable(r1)) ALT(SEQ(der(c, r1), r2), der(c, r2))
else SEQ(der(c, r1), r2)
case STAR(r) => SEQ(der(c, r), STAR(r))
+ case RANGE(cs) => if (cs contains c) ONE else ZERO
case PLUS(r) => SEQ(der(c, r), STAR(r))
- case NTIMES(r, i) =>
- if (i == 0) NULL else der(c, SEQ(r, NTIMES(r, i - 1)))
- case NUPTOM(r, i, j) =>
- if (i == 0 && j == 0) NULL else
- if (i == 0) ALT(der(c, NTIMES(r, j)), der(c, NUPTOM(r, 0, j - 1)))
- else der(c, SEQ(r, NUPTOM(r, i - 1, j)))
- case RANGE(cs) => if (cs contains c) EMPTY else NULL
+ case OPT(r) => der(c, r)
+ case NTIMES(r, n) =>
+ if (n == 0) ZERO else SEQ(der(c, r), NTIMES(r, n - 1))
+ case UPNTIMES(r, n) =>
+ if (n == 0) ZERO else SEQ(der(c, r), UPNTIMES(r, n - 1))
+ case FROMNTIMES(r, n) =>
+ if (n == 0) ZERO else SEQ(der(c, r), FROMNTIMES(r, n - 1))
+ case NMTIMES(r, n, m) =>
+ if (m < n) ZERO else
+ if (n == 0 && m == 0) ZERO else
+ if (n == 0) SEQ(der(c, r), UPNTIMES(r, m - 1))
+ else SEQ(der(c, r), NMTIMES(r, n - 1, m - 1))
case NOT(r) => NOT(der (c, r))
- case OPT(r) => der(c, r)
+ case CFUN(f) => if (f(c)) ONE else ZERO
}
+def simp(r: Rexp) : Rexp = r match {
+ case ALT(r1, r2) => (simp(r1), simp(r2)) match {
+ case (ZERO, r2s) => r2s
+ case (r1s, ZERO) => r1s
+ case (r1s, r2s) => if (r1s == r2s) r1s else ALT (r1s, r2s)
+ }
+ case SEQ(r1, r2) => (simp(r1), simp(r2)) match {
+ case (ZERO, _) => ZERO
+ case (_, ZERO) => ZERO
+ case (ONE, r2s) => r2s
+ case (r1s, ONE) => r1s
+ case (r1s, r2s) => SEQ(r1s, r2s)
+ }
+ case r => r
+}
+
+
// derivative w.r.t. a string (iterates der)
def ders (s: List[Char], r: Rexp) : Rexp = s match {
case Nil => r
@@ -140,15 +93,18 @@
def ders_simp (s: List[Char], r: Rexp) : Rexp = s match {
case Nil => r
- case c::s => ders_simp(s, der(c, r).simp)
+ case c::s => ders_simp(s, simp(der(c, r)))
}
-// main matcher function
-def matcher(r: Rexp, s: String) : Boolean = nullable(ders_simp(s.toList, r))
+// main matcher function including simplification
+def matcher(r: Rexp, s: String) : Boolean =
+ nullable(ders_simp(s.toList, r))
+
+
// some convenience for typing in regular expressions
def charlist2rexp(s : List[Char]) : Rexp = s match {
- case Nil => EMPTY
+ case Nil => ONE
case c::Nil => CHAR(c)
case c::s => SEQ(CHAR(c), charlist2rexp(s))
}
@@ -168,12 +124,29 @@
def ~ (r: String) = SEQ(s, r)
}
+val r1 = NTIMES("a", 3)
+val r2 = NTIMES(OPT("a"), 3)
+val r3 = UPNTIMES("a", 3)
+val r4 = UPNTIMES(OPT("a"), 3)
+val r5 = NMTIMES("a", 3, 5)
+val r6 = NMTIMES(OPT("a"), 3, 5)
+
+val rs = List(r1, r2, r3, r4, r5, r6)
+
+rs.map(matcher(_, ""))
+rs.map(matcher(_, "a"))
+rs.map(matcher(_, "aa"))
+rs.map(matcher(_, "aaa"))
+rs.map(matcher(_, "aaaa"))
+rs.map(matcher(_, "aaaaa"))
+rs.map(matcher(_, "aaaaaa"))
+
println("EMAIL:")
-val LOWERCASE = "abcdefghijklmnopqrstuvwxyz"
-val DIGITS = "0123456789"
-val EMAIL = PLUS(RANGE(LOWERCASE + DIGITS + "_" + "." + "-")) ~ "@" ~
- PLUS(RANGE(LOWERCASE + DIGITS + "." + "-")) ~ "." ~
- NMTIMES(RANGE(LOWERCASE + "."), 2, 6)
+val LOWERCASE = ('a' to 'z').toSet
+val DIGITS = ('0' to '9').toSet
+val EMAIL = { PLUS(CFUN(LOWERCASE | DIGITS | ("_.-").toSet)) ~ "@" ~
+ PLUS(CFUN(LOWERCASE | DIGITS | (".-").toSet)) ~ "." ~
+ NMTIMES(CFUN(LOWERCASE | Set('.')), 2, 6) }
val my_email = "christian.urban@kcl.ac.uk"
@@ -182,11 +155,11 @@
println(ders_simp(my_email.toList, EMAIL))
println("COMMENTS:")
-val ALL = RANGE(LOWERCASE + " ")
+val ALL = CFUN((c:Char) => true)
val COMMENT = "/*" ~ (NOT(ALL.% ~ "*/" ~ ALL.%)) ~ "*/"
println(matcher(COMMENT, "/**/"))
-println(matcher(COMMENT, "/*foobar_comment*/"))
+println(matcher(COMMENT, "/*foobar*/"))
println(matcher(COMMENT, "/*test*/test*/"))
println(matcher(COMMENT, "/*test/*test*/"))
@@ -209,7 +182,7 @@
println(matcher(EVIL2, "aaa" * 45 + "a"))
println("TEST for bug pointed out by Filips Ivanovs")
-val test = NMTIMES(RANGE(LOWERCASE + "."), 2, 6)
+val test = NMTIMES(CFUN(LOWERCASE | Set('.')), 2, 6)
println(matcher(test,"a"))
println(matcher(test,"ab"))
@@ -232,7 +205,6 @@
println("Rexp |" + s + "|")
println("Derivative:\n" + ders_simp(s.toList, r))
println("Is Nullable: " + nullable(ders_simp(s.toList, r)))
- println("Is Zeroable: " + zeroable(ders_simp(s.toList, r)))
Console.readLine
}
--- a/progs/dfa.scala Tue Sep 26 14:07:29 2017 +0100
+++ b/progs/dfa.scala Tue Sep 26 14:08:49 2017 +0100
@@ -1,64 +1,65 @@
-// DFAs
-import scala.util._
+// DFAs in Scala using partial functions
+import scala.util.Try
+
+// type abbreviation for partial functions
+type :=>[A, B] = PartialFunction[A, B]
-abstract class State
-type States = Set[State]
+case class DFA[A, C](start: A, // starting state
+ delta: (A, C) :=> A, // transition (partial fun)
+ fins: A => Boolean) { // final states
-case class IntState(i: Int) extends State
+ def deltas(q: A, s: List[C]) : A = s match {
+ case Nil => q
+ case c::cs => deltas(delta(q, c), cs)
+ }
+
+ def accepts(s: List[C]) : Boolean =
+ Try(fins(deltas(start, s))) getOrElse false
+}
-object NewState {
- var counter = 0
-
- def apply() : IntState = {
- counter += 1;
- new IntState(counter - 1)
- }
-}
+// the example shown in the handout
+abstract class State
+case object Q0 extends State
+case object Q1 extends State
+case object Q2 extends State
+case object Q3 extends State
+case object Q4 extends State
+
+val delta : (State, Char) :=> State =
+ { case (Q0, 'a') => Q1
+ case (Q0, 'b') => Q2
+ case (Q1, 'a') => Q4
+ case (Q1, 'b') => Q2
+ case (Q2, 'a') => Q3
+ case (Q2, 'b') => Q2
+ case (Q3, 'a') => Q4
+ case (Q3, 'b') => Q0
+ case (Q4, 'a') => Q4
+ case (Q4, 'b') => Q4 }
+
+val dfa = DFA(Q0, delta, Set[State](Q4))
+
+dfa.accepts("bbabaab".toList) // true
+dfa.accepts("baba".toList) // false
-// DFA class
-case class DFA(states: States,
- start: State,
- delta: (State, Char) => State,
- fins: States) {
-
- def deltas(q: State, s: List[Char]) : State = s match {
- case Nil => q
- case c::cs => deltas(delta(q, c), cs)
- }
-
- // wether a string is accepted by the automaton or not
- def accepts(s: String) : Boolean =
- Try(fins contains
- (deltas(start, s.toList))) getOrElse false
-}
-
+// another DFA test with a Sink state
+abstract class S
+case object S0 extends S
+case object S1 extends S
+case object S2 extends S
+case object Sink extends S
-// example DFA from the lectures
-val Q0 = NewState()
-val Q1 = NewState()
-val Q2 = NewState()
-val Q3 = NewState()
-val Q4 = NewState()
-
+// transition function with a sink state
+val sigma : (S, Char) :=> S =
+ { case (S0, 'a') => S1
+ case (S1, 'a') => S2
+ case _ => Sink
+ }
-val delta : (State, Char) => State = {
- case (Q0, 'a') => Q1
- case (Q0, 'b') => Q2
- case (Q1, 'a') => Q4
- case (Q1, 'b') => Q2
- case (Q2, 'a') => Q3
- case (Q2, 'b') => Q2
- case (Q3, 'a') => Q4
- case (Q3, 'b') => Q0
- case (Q4, 'a') => Q4
- case (Q4, 'b') => Q4
-}
+val dfa1a = DFA(S0, sigma, Set[S](S2))
-val DFA1 = DFA(Set(Q0, Q1, Q2, Q3, Q4), Q0, delta, Set(Q4))
+dfa1a.accepts("aa".toList) // true
+dfa1a.accepts("".toList) // false
+dfa1a.accepts("ab".toList) // false
-println(DFA1.accepts("aaa"))
-println(DFA1.accepts("bb"))
-println(DFA1.accepts("aaac"))
-
-
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/progs/display/dfa.scala Tue Sep 26 14:08:49 2017 +0100
@@ -0,0 +1,43 @@
+// DFAs in Scala using partial functions
+import scala.util.Try
+
+// type abbreviation for partial functions
+type :=>[A, B] = PartialFunction[A, B]
+
+case class DFA[A, C](start: A, // starting state
+ delta: (A, C) :=> A, // transition (partial fun)
+ fins: A => Boolean) { // final states
+
+ def deltas(q: A, s: List[C]) : A = s match {
+ case Nil => q
+ case c::cs => deltas(delta(q, c), cs)
+ }
+
+ def accepts(s: List[C]) : Boolean =
+ Try(fins(deltas(start, s))) getOrElse false
+}
+
+// the example shown earlier in the handout
+abstract class State
+case object Q0 extends State
+case object Q1 extends State
+case object Q2 extends State
+case object Q3 extends State
+case object Q4 extends State
+
+val delta : (State, Char) :=> State =
+ { case (Q0, 'a') => Q1
+ case (Q0, 'b') => Q2
+ case (Q1, 'a') => Q4
+ case (Q1, 'b') => Q2
+ case (Q2, 'a') => Q3
+ case (Q2, 'b') => Q2
+ case (Q3, 'a') => Q4
+ case (Q3, 'b') => Q0
+ case (Q4, 'a') => Q4
+ case (Q4, 'b') => Q4 }
+
+val dfa = DFA(Q0, delta, Set[State](Q4))
+
+dfa.accepts("bbabaab".toList) // true
+dfa.accepts("baba".toList) // false
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/progs/display/enfa.scala Tue Sep 26 14:08:49 2017 +0100
@@ -0,0 +1,39 @@
+// epsilon NFAs...immediately translated into NFAs
+// (needs dfa.scala and nfa.scala)
+
+// fixpoint construction
+import scala.annotation.tailrec
+@tailrec
+def fixpT[A](f: A => A, x: A): A = {
+ val fx = f(x)
+ if (fx == x) x else fixpT(f, fx)
+}
+
+// translates eNFAs directly into NFAs
+def eNFA[A, C](starts: Set[A], // starting states
+ delta: (A, Option[C]) :=> Set[A], // epsilon-transitions
+ fins: A => Boolean) : NFA[A, C] = { // final states
+
+ // epsilon transitions
+ def enext(q: A) : Set[A] =
+ applyOrElse(delta, (q, None))
+
+ def enexts(qs: Set[A]) : Set[A] =
+ qs | qs.flatMap(enext(_)) // | is the set-union in Scala
+
+ // epsilon closure
+ def ecl(qs: Set[A]) : Set[A] =
+ fixpT(enexts, qs)
+
+ // "normal" transitions
+ def next(q: A, c: C) : Set[A] =
+ applyOrElse(delta, (q, Some(c)))
+
+ def nexts(qs: Set[A], c: C) : Set[A] =
+ ecl(ecl(qs).flatMap(next(_, c)))
+
+ // result NFA
+ NFA(ecl(starts),
+ { case (q, c) => nexts(Set(q), c) },
+ q => ecl(Set(q)) exists fins)
+}
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/progs/display/nfa.scala Tue Sep 26 14:08:49 2017 +0100
@@ -0,0 +1,38 @@
+// NFAs in Scala using partial functions (returning
+// sets of states)
+//
+// needs :load dfa.scala for states
+
+
+// type abbreviation for partial functions
+type :=>[A, B] = PartialFunction[A, B]
+
+// return an empty set when not defined
+def applyOrElse[A, B](f: A :=> Set[B], x: A) : Set[B] =
+ Try(f(x)) getOrElse Set[B]()
+
+
+// NFAs
+case class NFA[A, C](starts: Set[A], // starting states
+ delta: (A, C) :=> Set[A], // transition function
+ fins: A => Boolean) { // final states
+
+ // given a state and a character, what is the set of
+ // next states? if there is none => empty set
+ def next(q: A, c: C) : Set[A] =
+ applyOrElse(delta, (q, c))
+
+ def nexts(qs: Set[A], c: C) : Set[A] =
+ qs.flatMap(next(_, c))
+
+ // given some states and a string, what is the set
+ // of next states?
+ def deltas(qs: Set[A], s: List[C]) : Set[A] = s match {
+ case Nil => qs
+ case c::cs => deltas(nexts(qs, c), cs)
+ }
+
+ // is a string accepted by an NFA?
+ def accepts(s: List[C]) : Boolean =
+ deltas(starts, s).exists(fins)
+}
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/progs/display/thompson1.scala Tue Sep 26 14:08:49 2017 +0100
@@ -0,0 +1,41 @@
+// Thompson Construction (Part 1)
+// (needs :load dfa.scala
+// :load nfa.scala
+// :load enfa.scala)
+
+
+// states for Thompson construction
+case class TState(i: Int) extends State
+
+object TState {
+ var counter = 0
+
+ def apply() : TState = {
+ counter += 1;
+ new TState(counter - 1)
+ }
+}
+
+
+// a type abbreviation
+type NFAt = NFA[TState, Char]
+
+
+// a NFA that does not accept any string
+def NFA_ZERO(): NFAt = {
+ val Q = TState()
+ NFA(Set(Q), { case _ => Set() }, Set())
+}
+
+// a NFA that accepts the empty string
+def NFA_ONE() : NFAt = {
+ val Q = TState()
+ NFA(Set(Q), { case _ => Set() }, Set(Q))
+}
+
+// a NFA that accepts the string "c"
+def NFA_CHAR(c: Char) : NFAt = {
+ val Q1 = TState()
+ val Q2 = TState()
+ NFA(Set(Q1), { case (Q1, d) if (c == d) => Set(Q2) }, Set(Q2))
+}
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/progs/display/thompson2.scala Tue Sep 26 14:08:49 2017 +0100
@@ -0,0 +1,43 @@
+// Thompson Construction (Part 2)
+
+// some more type abbreviations
+type NFAtrans = (TState, Char) :=> Set[TState]
+type eNFAtrans = (TState, Option[Char]) :=> Set[TState]
+
+
+// for composing an eNFA transition with a NFA transition
+implicit class RichPF(val f: eNFAtrans) extends AnyVal {
+ def +++(g: NFAtrans) : eNFAtrans =
+ { case (q, None) => applyOrElse(f, (q, None))
+ case (q, Some(c)) =>
+ applyOrElse(f, (q, Some(c))) | applyOrElse(g, (q, c)) }
+}
+
+// sequence of two NFAs
+def NFA_SEQ(enfa1: NFAt, enfa2: NFAt) : NFAt = {
+ val new_delta : eNFAtrans =
+ { case (q, None) if enfa1.fins(q) => enfa2.starts }
+
+ eNFA(enfa1.starts, new_delta +++ enfa1.delta +++ enfa2.delta,
+ enfa2.fins)
+}
+
+// alternative of two NFAs
+def NFA_ALT(enfa1: NFAt, enfa2: NFAt) : NFAt = {
+ val new_delta : NFAtrans = {
+ case (q, c) => applyOrElse(enfa1.delta, (q, c)) |
+ applyOrElse(enfa2.delta, (q, c)) }
+ val new_fins = (q: TState) => enfa1.fins(q) || enfa2.fins(q)
+
+ NFA(enfa1.starts | enfa2.starts, new_delta, new_fins)
+}
+
+// star of a NFA
+def NFA_STAR(enfa: NFAt) : NFAt = {
+ val Q = TState()
+ val new_delta : eNFAtrans =
+ { case (Q, None) => enfa.starts
+ case (q, None) if enfa.fins(q) => Set(Q) }
+
+ eNFA(Set(Q), new_delta +++ enfa.delta, Set(Q))
+}
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/progs/enfa.scala Tue Sep 26 14:08:49 2017 +0100
@@ -0,0 +1,66 @@
+// epsilon NFAs...immediately translated into NFAs
+// (needs :load dfa.scala
+// :load nfa.scala in REPL)
+
+// fixpoint construction
+import scala.annotation.tailrec
+@tailrec
+def fixpT[A](f: A => A, x: A): A = {
+ val fx = f(x)
+ if (fx == x) x else fixpT(f, fx)
+}
+
+// translates eNFAs directly into NFAs
+def eNFA[A, C](starts: Set[A], // starting states
+ delta: (A, Option[C]) :=> Set[A], // epsilon-transitions
+ fins: A => Boolean) : NFA[A, C] = { // final states
+
+ // epsilon transitions
+ def enext(q: A) : Set[A] =
+ applyOrElse(delta, (q, None))
+
+ def enexts(qs: Set[A]) : Set[A] =
+ qs | qs.flatMap(enext(_)) // | is the set-union in Scala
+
+ // epsilon closure
+ def ecl(qs: Set[A]) : Set[A] =
+ fixpT(enexts, qs)
+
+ // "normal" transitions
+ def next(q: A, c: C) : Set[A] =
+ applyOrElse(delta, (q, Some(c)))
+
+ def nexts(qs: Set[A], c: C) : Set[A] =
+ ecl(ecl(qs).flatMap(next(_, c)))
+
+ // result NFA
+ NFA(ecl(starts),
+ { case (q, c) => nexts(Set(q), c) },
+ q => ecl(Set(q)) exists fins)
+}
+
+
+// eNFA examples
+val enfa_trans1 : (State, Option[Char]) :=> Set[State] =
+ { case (Q0, Some('a')) => Set(Q0)
+ case (Q0, None) => Set(Q1, Q2)
+ case (Q1, Some('a')) => Set(Q1)
+ case (Q2, Some('b')) => Set(Q2)
+ }
+
+val enfa1 = eNFA(Set[State](Q0), enfa_trans1, Set[State](Q2))
+
+
+//
+case object R1 extends State
+case object R2 extends State
+case object R3 extends State
+
+val enfa_trans2 : (State, Option[Char]) :=> Set[State] =
+ { case (R1, Some('b')) => Set(R3)
+ case (R1, None) => Set(R2)
+ case (R2, Some('a')) => Set(R1, R3)
+ }
+
+
+val enfa2 = eNFA(Set[State](R1), enfa_trans1, Set[State](R3))
--- a/progs/nfa.scala Tue Sep 26 14:07:29 2017 +0100
+++ b/progs/nfa.scala Tue Sep 26 14:08:49 2017 +0100
@@ -1,242 +1,91 @@
-// NFAs and Thompson's construction
-
-// helper function for recording time
-def time_needed[T](i: Int, code: => T) = {
- val start = System.nanoTime()
- for (j <- 1 to i) code
- val end = System.nanoTime()
- (end - start)/(i * 1.0e9)
-}
-
-
-// state nodes
-abstract class State
-type States = Set[State]
-
-case class IntState(i: Int) extends State
-
-object NewState {
- var counter = 0
-
- def apply() : IntState = {
- counter += 1;
- new IntState(counter - 1)
- }
-}
+// NFAs in Scala using partial functions (returning
+// sets of states)
+//
+// needs :load dfa.scala for states
-case class NFA(states: States,
- start: State,
- delta: (State, Char) => States,
- eps: State => States,
- fins: States) {
-
- def epsclosure(qs: States) : States = {
- val ps = qs ++ qs.flatMap(eps(_))
- if (qs == ps) ps else epsclosure(ps)
- }
-
- def deltas(qs: States, s: List[Char]) : States = s match {
- case Nil => epsclosure(qs)
- case c::cs => deltas(epsclosure(epsclosure(qs).flatMap (delta (_, c))), cs)
- }
-
- def accepts(s: String) : Boolean =
- deltas(Set(start), s.toList) exists (fins contains (_))
-}
+// type abbreviation for partial functions
+type :=>[A, B] = PartialFunction[A, B]
-// A small example NFA from the lectures
-val Q0 = NewState()
-val Q1 = NewState()
-val Q2 = NewState()
-
-val delta : (State, Char) => States = {
- case (Q0, 'a') => Set(Q0)
- case (Q1, 'a') => Set(Q1)
- case (Q2, 'b') => Set(Q2)
- case (_, _) => Set ()
-}
-
-val eps : State => States = {
- case Q0 => Set(Q1, Q2)
- case _ => Set()
-}
-
-val NFA1 = NFA(Set(Q0, Q1, Q2), Q0, delta, eps, Set(Q2))
-
-NFA1.accepts("aa")
-NFA1.accepts("aaaaa")
-NFA1.accepts("aaaaabbb")
-NFA1.accepts("aaaaabbbaaa")
-NFA1.accepts("ac")
+// return an empty set when not defined
+def applyOrElse[A, B](f: A :=> Set[B], x: A) : Set[B] =
+ Try(f(x)) getOrElse Set[B]()
-// explicit construction of some NFAs; used in
-// Thompson's construction
-
-// NFA that does not accept any string
-def NFA_NULL() : NFA = {
- val Q = NewState()
- NFA(Set(Q), Q, { case (_, _) => Set() }, { case _ => Set() }, Set())
-}
-
-// NFA that accepts the empty string
-def NFA_EMPTY() : NFA = {
- val Q = NewState()
- NFA(Set(Q), Q, { case (_, _) => Set() }, { case _ => Set() }, Set(Q))
-}
-
-// NFA that accepts the string "c"
-def NFA_CHAR(c: Char) : NFA = {
- val Q1 = NewState()
- val Q2 = NewState()
- NFA(Set(Q1, Q2),
- Q1,
- { case (Q1, d) if (c == d) => Set(Q2)
- case (_, _) => Set() },
- { case _ => Set() },
- Set(Q2))
-}
-
-// alternative of two NFAs
-def NFA_ALT(nfa1: NFA, nfa2: NFA) : NFA = {
- val Q = NewState()
- NFA(Set(Q) ++ nfa1.states ++ nfa2.states,
- Q,
- { case (q, c) if (nfa1.states contains q) => nfa1.delta(q, c)
- case (q, c) if (nfa2.states contains q) => nfa2.delta(q, c)
- case (_, _) => Set() },
- { case Q => Set(nfa1.start, nfa2.start)
- case q if (nfa1.states contains q) => nfa1.eps(q)
- case q if (nfa2.states contains q) => nfa2.eps(q)
- case _ => Set() },
- nfa1.fins ++ nfa2.fins)
-}
-
-// sequence of two NFAs
-def NFA_SEQ(nfa1: NFA, nfa2: NFA) : NFA = {
- NFA(nfa1.states ++ nfa2.states,
- nfa1.start,
- { case (q, c) if (nfa1.states contains q) => nfa1.delta(q, c)
- case (q, c) if (nfa2.states contains q) => nfa2.delta(q, c)
- case (_, _) => Set() },
- { case q if (nfa1.fins contains q) => nfa1.eps(q) ++ Set(nfa2.start)
- case q if (nfa1.states contains q) => nfa1.eps(q)
- case q if (nfa2.states contains q) => nfa2.eps(q)
- case _ => Set() },
- nfa2.fins)
-}
-
-// star of an NFA
-def NFA_STAR(nfa: NFA) : NFA = {
- val Q = NewState()
- NFA(Set(Q) ++ nfa.states,
- Q,
- nfa.delta,
- { case Q => Set(nfa.start)
- case q if (nfa.fins contains q) => nfa.eps(q) ++ Set(Q)
- case q if (nfa.states contains q) => nfa.eps(q)
- case _ => Set() },
- Set(Q))
-}
-
-
-// regular expressions used for Thompson's construction
-abstract class Rexp
+// NFAs
+case class NFA[A, C](starts: Set[A], // starting states
+ delta: (A, C) :=> Set[A], // transition function
+ fins: A => Boolean) { // final states
-case object NULL extends Rexp
-case object EMPTY extends Rexp
-case class CHAR(c: Char) extends Rexp
-case class ALT(r1: Rexp, r2: Rexp) extends Rexp
-case class SEQ(r1: Rexp, r2: Rexp) extends Rexp
-case class STAR(r: Rexp) extends Rexp
-
-// some convenience for typing in regular expressions
-def charlist2rexp(s : List[Char]) : Rexp = s match {
- case Nil => EMPTY
- case c::Nil => CHAR(c)
- case c::s => SEQ(CHAR(c), charlist2rexp(s))
-}
-implicit def string2rexp(s : String) : Rexp = charlist2rexp(s.toList)
-
-
-
-def thompson (r: Rexp) : NFA = r match {
- case NULL => NFA_NULL()
- case EMPTY => NFA_EMPTY()
- case CHAR(c) => NFA_CHAR(c)
- case ALT(r1, r2) => NFA_ALT(thompson(r1), thompson(r2))
- case SEQ(r1, r2) => NFA_SEQ(thompson(r1), thompson(r2))
- case STAR(r1) => NFA_STAR(thompson(r1))
-}
-
-// some examples for Thompson's
-val A = thompson(CHAR('a'))
-
-println(A.accepts("a"))
-println(A.accepts("c"))
-println(A.accepts("aa"))
-
-val B = thompson(ALT("ab","ac"))
-
-println(B.accepts("ab"))
-println(B.accepts("ac"))
-println(B.accepts("bb"))
-println(B.accepts("aa"))
+ // given a state and a character, what is the set of
+ // next states? if there is none => empty set
+ def next(q: A, c: C) : Set[A] =
+ applyOrElse(delta, (q, c))
-val C = thompson(STAR("ab"))
-
-println(C.accepts(""))
-println(C.accepts("a"))
-println(C.accepts("ababab"))
-println(C.accepts("ab"))
-println(C.accepts("ac"))
-println(C.accepts("bb"))
-println(C.accepts("aa"))
-
-// regular expression matcher using Thompson's
-def matcher(r: Rexp, s: String) : Boolean = thompson(r).accepts(s)
-
-
-//optional
-def OPT(r: Rexp) = ALT(r, EMPTY)
+ def nexts(qs: Set[A], c: C) : Set[A] =
+ qs.flatMap(next(_, c))
-//n-times
-def NTIMES(r: Rexp, n: Int) : Rexp = n match {
- case 0 => EMPTY
- case 1 => r
- case n => SEQ(r, NTIMES(r, n - 1))
-}
-
-// evil regular exproession
-def EVIL(n: Int) = SEQ(NTIMES(OPT("a"), n), NTIMES("a", n))
-
-// test harness for the matcher
-for (i <- 0 to 100 by 5) {
- println(i + ": " + "%.5f".format(time_needed(1, matcher(EVIL(i), "a" * i))))
-}
-
-
-// regular expression matching via search and backtracking
-def accepts2(nfa: NFA, s: String) : Boolean = {
-
- def search(q: State, s: List[Char]) : Boolean = s match {
- case Nil => nfa.fins contains (q)
- case c::cs =>
- (nfa.delta(q, c) exists (search(_, cs))) ||
- (nfa.eps(q) exists (search(_, c::cs)))
+ // given some states and a string, what is the set
+ // of next states?
+ def deltas(qs: Set[A], s: List[C]) : Set[A] = s match {
+ case Nil => qs
+ case c::cs => deltas(nexts(qs, c), cs)
}
- search(nfa.start, s.toList)
-}
+ // is a string accepted by an NFA?
+ def accepts(s: List[C]) : Boolean =
+ deltas(starts, s).exists(fins)
-def matcher2(r: Rexp, s: String) : Boolean = accepts2(thompson(r), s)
+ // depth-first version of accepts
+ def search(q: A, s: List[C]) : Boolean = s match {
+ case Nil => fins(q)
+ case c::cs => next(q, c).exists(search(_, cs))
+ }
-// test harness for the backtracking matcher
-for (i <- 0 to 20 by 1) {
- println(i + ": " + "%.5f".format(time_needed(1, matcher2(EVIL(i), "a" * i))))
+ def accepts2(s: List[C]) : Boolean =
+ starts.exists(search(_, s))
}
+// NFA examples
+val nfa_trans1 : (State, Char) :=> Set[State] =
+ { case (Q0, 'a') => Set(Q0, Q1)
+ case (Q0, 'b') => Set(Q2)
+ case (Q1, 'a') => Set(Q1)
+ case (Q2, 'b') => Set(Q2) }
+
+val nfa1 = NFA(Set[State](Q0), nfa_trans1, Set[State](Q2))
+
+nfa1.accepts("aa".toList) // false
+nfa1.accepts("aaaaa".toList) // false
+nfa1.accepts("aaaaab".toList) // true
+nfa1.accepts("aaaaabbb".toList) // true
+nfa1.accepts("aaaaabbbaaa".toList) // false
+nfa1.accepts("ac".toList) // false
+
+nfa1.accepts2("aa".toList) // false
+nfa1.accepts2("aaaaa".toList) // false
+nfa1.accepts2("aaaaab".toList) // true
+nfa1.accepts2("aaaaabbb".toList) // true
+nfa1.accepts2("aaaaabbbaaa".toList) // false
+nfa1.accepts2("ac".toList) // false
+
+
+
+
+// subset constructions
+
+def subset[A, C](nfa: NFA[A, C]) : DFA[Set[A], C] = {
+ DFA(nfa.starts,
+ { case (qs, c) => nfa.nexts(qs, c) },
+ _.exists(nfa.fins))
+}
+
+subset(nfa1).accepts("aa".toList) // false
+subset(nfa1).accepts("aaaaa".toList) // false
+subset(nfa1).accepts("aaaaab".toList) // true
+subset(nfa1).accepts("aaaaabbb".toList) // true
+subset(nfa1).accepts("aaaaabbbaaa".toList) // false
+subset(nfa1).accepts("ac".toList) // false
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/progs/nfa2.scala Tue Sep 26 14:08:49 2017 +0100
@@ -0,0 +1,242 @@
+// NFAs and Thompson's construction
+
+// helper function for recording time
+def time_needed[T](i: Int, code: => T) = {
+ val start = System.nanoTime()
+ for (j <- 1 to i) code
+ val end = System.nanoTime()
+ (end - start)/(i * 1.0e9)
+}
+
+
+// state nodes
+abstract class State
+type States = Set[State]
+
+case class IntState(i: Int) extends State
+
+object NewState {
+ var counter = 0
+
+ def apply() : IntState = {
+ counter += 1;
+ new IntState(counter - 1)
+ }
+}
+
+
+case class NFA(states: States,
+ start: State,
+ delta: (State, Char) => States,
+ eps: State => States,
+ fins: States) {
+
+ def epsclosure(qs: States) : States = {
+ val ps = qs ++ qs.flatMap(eps(_))
+ if (qs == ps) ps else epsclosure(ps)
+ }
+
+ def deltas(qs: States, s: List[Char]) : States = s match {
+ case Nil => epsclosure(qs)
+ case c::cs => deltas(epsclosure(epsclosure(qs).flatMap (delta (_, c))), cs)
+ }
+
+ def accepts(s: String) : Boolean =
+ deltas(Set(start), s.toList) exists (fins contains (_))
+}
+
+// A small example NFA from the lectures
+val Q0 = NewState()
+val Q1 = NewState()
+val Q2 = NewState()
+
+val delta : (State, Char) => States = {
+ case (Q0, 'a') => Set(Q0)
+ case (Q1, 'a') => Set(Q1)
+ case (Q2, 'b') => Set(Q2)
+ case (_, _) => Set ()
+}
+
+val eps : State => States = {
+ case Q0 => Set(Q1, Q2)
+ case _ => Set()
+}
+
+val NFA1 = NFA(Set(Q0, Q1, Q2), Q0, delta, eps, Set(Q2))
+
+NFA1.accepts("aa")
+NFA1.accepts("aaaaa")
+NFA1.accepts("aaaaabbb")
+NFA1.accepts("aaaaabbbaaa")
+NFA1.accepts("ac")
+
+
+// explicit construction of some NFAs; used in
+// Thompson's construction
+
+// NFA that does not accept any string
+def NFA_NULL() : NFA = {
+ val Q = NewState()
+ NFA(Set(Q), Q, { case (_, _) => Set() }, { case _ => Set() }, Set())
+}
+
+// NFA that accepts the empty string
+def NFA_EMPTY() : NFA = {
+ val Q = NewState()
+ NFA(Set(Q), Q, { case (_, _) => Set() }, { case _ => Set() }, Set(Q))
+}
+
+// NFA that accepts the string "c"
+def NFA_CHAR(c: Char) : NFA = {
+ val Q1 = NewState()
+ val Q2 = NewState()
+ NFA(Set(Q1, Q2),
+ Q1,
+ { case (Q1, d) if (c == d) => Set(Q2)
+ case (_, _) => Set() },
+ { case _ => Set() },
+ Set(Q2))
+}
+
+// alternative of two NFAs
+def NFA_ALT(nfa1: NFA, nfa2: NFA) : NFA = {
+ val Q = NewState()
+ NFA(Set(Q) ++ nfa1.states ++ nfa2.states,
+ Q,
+ { case (q, c) if (nfa1.states contains q) => nfa1.delta(q, c)
+ case (q, c) if (nfa2.states contains q) => nfa2.delta(q, c)
+ case (_, _) => Set() },
+ { case Q => Set(nfa1.start, nfa2.start)
+ case q if (nfa1.states contains q) => nfa1.eps(q)
+ case q if (nfa2.states contains q) => nfa2.eps(q)
+ case _ => Set() },
+ nfa1.fins ++ nfa2.fins)
+}
+
+// sequence of two NFAs
+def NFA_SEQ(nfa1: NFA, nfa2: NFA) : NFA = {
+ NFA(nfa1.states ++ nfa2.states,
+ nfa1.start,
+ { case (q, c) if (nfa1.states contains q) => nfa1.delta(q, c)
+ case (q, c) if (nfa2.states contains q) => nfa2.delta(q, c)
+ case (_, _) => Set() },
+ { case q if (nfa1.fins contains q) => nfa1.eps(q) ++ Set(nfa2.start)
+ case q if (nfa1.states contains q) => nfa1.eps(q)
+ case q if (nfa2.states contains q) => nfa2.eps(q)
+ case _ => Set() },
+ nfa2.fins)
+}
+
+// star of an NFA
+def NFA_STAR(nfa: NFA) : NFA = {
+ val Q = NewState()
+ NFA(Set(Q) ++ nfa.states,
+ Q,
+ nfa.delta,
+ { case Q => Set(nfa.start)
+ case q if (nfa.fins contains q) => nfa.eps(q) ++ Set(Q)
+ case q if (nfa.states contains q) => nfa.eps(q)
+ case _ => Set() },
+ Set(Q))
+}
+
+
+// regular expressions used for Thompson's construction
+abstract class Rexp
+
+case object NULL extends Rexp
+case object EMPTY extends Rexp
+case class CHAR(c: Char) extends Rexp
+case class ALT(r1: Rexp, r2: Rexp) extends Rexp
+case class SEQ(r1: Rexp, r2: Rexp) extends Rexp
+case class STAR(r: Rexp) extends Rexp
+
+// some convenience for typing in regular expressions
+def charlist2rexp(s : List[Char]) : Rexp = s match {
+ case Nil => EMPTY
+ case c::Nil => CHAR(c)
+ case c::s => SEQ(CHAR(c), charlist2rexp(s))
+}
+implicit def string2rexp(s : String) : Rexp = charlist2rexp(s.toList)
+
+
+
+def thompson (r: Rexp) : NFA = r match {
+ case NULL => NFA_NULL()
+ case EMPTY => NFA_EMPTY()
+ case CHAR(c) => NFA_CHAR(c)
+ case ALT(r1, r2) => NFA_ALT(thompson(r1), thompson(r2))
+ case SEQ(r1, r2) => NFA_SEQ(thompson(r1), thompson(r2))
+ case STAR(r1) => NFA_STAR(thompson(r1))
+}
+
+// some examples for Thompson's
+val A = thompson(CHAR('a'))
+
+println(A.accepts("a")) // true
+println(A.accepts("c")) // false
+println(A.accepts("aa")) // false
+
+val B = thompson(ALT("ab","ac"))
+
+println(B.accepts("ab")) // true
+println(B.accepts("ac")) // true
+println(B.accepts("bb")) // false
+println(B.accepts("aa")) // false
+
+val C = thompson(STAR("ab"))
+
+println(C.accepts("")) // true
+println(C.accepts("a")) // false
+println(C.accepts("ababab")) // true
+println(C.accepts("ab")) // true
+println(C.accepts("ac")) // false
+println(C.accepts("bb")) // false
+println(C.accepts("aa")) // false
+
+// regular expression matcher using Thompson's
+def matcher(r: Rexp, s: String) : Boolean = thompson(r).accepts(s)
+
+
+//optional
+def OPT(r: Rexp) = ALT(r, EMPTY)
+
+//n-times
+def NTIMES(r: Rexp, n: Int) : Rexp = n match {
+ case 0 => EMPTY
+ case 1 => r
+ case n => SEQ(r, NTIMES(r, n - 1))
+}
+
+// evil regular exproession
+def EVIL(n: Int) = SEQ(NTIMES(OPT("a"), n), NTIMES("a", n))
+
+// test harness for the matcher
+for (i <- 0 to 100 by 5) {
+ println(i + ": " + "%.5f".format(time_needed(1, matcher(EVIL(i), "a" * i))))
+}
+
+
+// regular expression matching via search and backtracking
+def accepts2(nfa: NFA, s: String) : Boolean = {
+
+ def search(q: State, s: List[Char]) : Boolean = s match {
+ case Nil => nfa.fins contains (q)
+ case c::cs =>
+ (nfa.delta(q, c) exists (search(_, cs))) ||
+ (nfa.eps(q) exists (search(_, c::cs)))
+ }
+
+ search(nfa.start, s.toList)
+}
+
+def matcher2(r: Rexp, s: String) : Boolean = accepts2(thompson(r), s)
+
+// test harness for the backtracking matcher
+for (i <- 0 to 20 by 1) {
+ println(i + ": " + "%.5f".format(time_needed(1, matcher2(EVIL(i), "a" * i))))
+}
+
+
+
+
--- a/progs/re0.scala Tue Sep 26 14:07:29 2017 +0100
+++ b/progs/re0.scala Tue Sep 26 14:08:49 2017 +0100
@@ -1,4 +1,5 @@
import scala.annotation.tailrec
+import scala.language.implicitConversions
abstract class Rexp
@@ -16,7 +17,7 @@
// some convenience for typing in regular expressions
implicit def string2rexp(s : String) : Rexp = STR(s)
-implicit def RexpOps(r: Rexp) = new {
+implicit def RexpOps(r: Rexp) : Rexp = new {
def | (s: Rexp) = ALT(r, s)
def % = STAR(r)
def %(n: Int) = REP(r, n)
@@ -26,7 +27,7 @@
def ~ (s: Rexp) = SEQ(r, s)
}
-implicit def stringOps(s: String) = new {
+implicit def stringOps(s: String) : Rexp = new {
def | (r: Rexp) = ALT(s, r)
def | (r: String) = ALT(s, r)
def % = STAR(s)
@@ -72,6 +73,7 @@
if (i == 0) NULL else SEQ(der(c, r), REP(r, i - 1))
}
+
// derivative w.r.t. a string (iterates der)
@tailrec
def ders (s: List[Char], r: Rexp) : Rexp = s match {
--- a/progs/re1.scala Tue Sep 26 14:07:29 2017 +0100
+++ b/progs/re1.scala Tue Sep 26 14:08:49 2017 +0100
@@ -1,3 +1,6 @@
+// Simple matcher for basic regular expressions
+
+object Test {
abstract class Rexp
case object ZERO extends Rexp // matches nothing
@@ -19,6 +22,8 @@
}
+}
+
// derivative of a regular expression w.r.t. a character
def der (c: Char, r: Rexp) : Rexp = r match {
case ZERO => ZERO
@@ -40,22 +45,27 @@
// main matcher function
def matches(r: Rexp, s: String) : Boolean = nullable(ders(s.toList, r))
+
//examples from the homework
+
val r = STAR(ALT(SEQ(CHAR('a'), CHAR('b')), CHAR('b')))
der('a', r)
der('b', r)
der('c', r)
-//optional (one or zero times)
+//optional regular expression (one or zero times)
def OPT(r: Rexp) = ALT(r, ONE)
-//n-times (explicitly expanded)
+//n-times regular expression (explicitly expanded)
def NTIMES(r: Rexp, n: Int) : Rexp = n match {
case 0 => ONE
case 1 => r
case n => SEQ(r, NTIMES(r, n - 1))
}
+
+// Test Cases
+
// the evil regular expression a?{n} a{n}
def EVIL1(n: Int) = SEQ(NTIMES(OPT(CHAR('a')), n), NTIMES(CHAR('a'), n))
@@ -70,6 +80,7 @@
(end - start)/(i * 1.0e9)
}
+
//test: (a?{n}) (a{n})
for (i <- 1 to 20) {
println(i + ": " + "%.5f".format(time_needed(2, matches(EVIL1(i), "a" * i))))
@@ -88,3 +99,43 @@
println(i + " " + "%.5f".format(time_needed(2, matches(EVIL2, "a" * i))))
}
+
+
+
+// size of a regular expressions - for testing purposes
+def size(r: Rexp) : Int = r match {
+ case ZERO => 1
+ case ONE => 1
+ case CHAR(_) => 1
+ case ALT(r1, r2) => 1 + size(r1) + size(r2)
+ case SEQ(r1, r2) => 1 + size(r1) + size(r2)
+ case STAR(r) => 1 + size(r)
+}
+
+// the expicit expansion in EVIL1(n) increases
+// drastically its size
+
+size(EVIL1(1)) // 5
+size(EVIL1(3)) // 17
+size(EVIL1(5)) // 29
+size(EVIL1(7)) // 41
+
+
+// given a regular expression and building successive
+// derivatives might result into bigger and bigger
+// regular expressions...here is an example for this:
+
+val BIG_aux = STAR(ALT(CHAR('a'), CHAR('b')))
+val BIG = SEQ(BIG_aux, SEQ(CHAR('a'),SEQ(CHAR('b'), BIG_aux)))
+
+size(ders("".toList, BIG)) // 13
+size(ders("ab".toList, BIG)) // 51
+size(ders("abab".toList, BIG)) // 112
+size(ders("ababab".toList, BIG)) // 191
+size(ders("abababab".toList, BIG)) // 288
+size(ders("ababababab".toList, BIG)) // 403
+size(ders("abababababab".toList, BIG)) // 536
+
+
+size(ders(("ab" * 200).toList, BIG)) // 366808
+
--- a/progs/re2.scala Tue Sep 26 14:07:29 2017 +0100
+++ b/progs/re2.scala Tue Sep 26 14:08:49 2017 +0100
@@ -1,5 +1,6 @@
-// version with explicit an n-times regular expression
-// this keeps the regular expression small
+// Version with explicit an explicit n-times regular expression;
+// this keeps the overall regular expression in the EVIL1 regular
+// expression small
abstract class Rexp
case object ZERO extends Rexp
@@ -8,7 +9,8 @@
case class ALT(r1: Rexp, r2: Rexp) extends Rexp
case class SEQ(r1: Rexp, r2: Rexp) extends Rexp
case class STAR(r: Rexp) extends Rexp
-case class NTIMES(r: Rexp, n: Int) extends Rexp //explicit constructor
+case class NTIMES(r: Rexp, n: Int) extends Rexp //explicit constructor for n-times
+
def nullable (r: Rexp) : Boolean = r match {
case ZERO => false
@@ -20,6 +22,7 @@
case NTIMES(r, i) => if (i == 0) true else nullable(r)
}
+
def der (c: Char, r: Rexp) : Rexp = r match {
case ZERO => ZERO
case ONE => ZERO
@@ -30,7 +33,7 @@
else SEQ(der(c, r1), r2)
case STAR(r1) => SEQ(der(c, r1), STAR(r1))
case NTIMES(r1, i) =>
- if (i == 0) ZERO else der(c, SEQ(r1, NTIMES(r1, i - 1)))
+ if (i == 0) ZERO else SEQ(der(c, r1), NTIMES(r1, i - 1))
}
def ders (s: List[Char], r: Rexp) : Rexp = s match {
@@ -41,9 +44,12 @@
def matches(r: Rexp, s: String) : Boolean = nullable(ders(s.toList, r))
-//optional: one or zero times
+//optional regular expression: one or zero times
def OPT(r: Rexp) = ALT(r, ONE)
+
+// Test Cases
+
//evil regular expressions
def EVIL1(n: Int) = SEQ(NTIMES(OPT(CHAR('a')), n), NTIMES(CHAR('a'), n))
val EVIL2 = SEQ(STAR(STAR(CHAR('a'))), CHAR('b'))
@@ -57,11 +63,11 @@
//test: (a?{n}) (a{n})
-for (i <- 1 to 1001 by 10) {
+for (i <- 1 to 1201 by 100) {
println(i + " " + "%.5f".format(time_needed(2, matches(EVIL1(i), "a" * i))))
}
-for (i <- 1 to 1001 by 10) {
+for (i <- 1 to 1201 by 100) {
println(i + " " + "%.5f".format(time_needed(2, matches(EVIL1(i), "a" * i))))
}
@@ -76,3 +82,32 @@
}
+
+// size of a regular expressions - for testing purposes
+def size(r: Rexp) : Int = r match {
+ case ZERO => 1
+ case ONE => 1
+ case CHAR(_) => 1
+ case ALT(r1, r2) => 1 + size(r1) + size(r2)
+ case SEQ(r1, r2) => 1 + size(r1) + size(r2)
+ case STAR(r) => 1 + size(r)
+ case NTIMES(r, _) => 1 + size(r)
+}
+
+// EVIL1(n) has now a constant size, no matter
+// what n is
+
+size(EVIL1(1)) // 7
+size(EVIL1(3)) // 7
+size(EVIL1(5)) // 7
+size(EVIL1(7)) // 7
+
+
+// but the size of the derivatives can still grow
+// quite dramatically
+
+size(ders("".toList, EVIL2)) // 5
+size(ders("a".toList, EVIL2)) // 12
+size(ders("aa".toList, EVIL2)) // 28
+size(ders("aaa".toList, EVIL2)) // 58
+size(ders("aaaa".toList, EVIL2)) // 116
--- a/progs/re3.scala Tue Sep 26 14:07:29 2017 +0100
+++ b/progs/re3.scala Tue Sep 26 14:08:49 2017 +0100
@@ -1,3 +1,6 @@
+// Version with simplification during derivatives;
+// this keeps the regular expressions small, which
+// is good for run-time
abstract class Rexp
case object ZERO extends Rexp
@@ -32,7 +35,7 @@
else SEQ(der(c, r1), r2)
case STAR(r1) => SEQ(der(c, r1), STAR(r1))
case NTIMES(r, i) =>
- if (i == 0) ZERO else der(c, SEQ(r, NTIMES(r, i - 1)))
+ if (i == 0) ZERO else SEQ(der(c, r), NTIMES(r, i - 1))
}
def simp(r: Rexp) : Rexp = r match {
@@ -63,12 +66,12 @@
def matcher(r: Rexp, s: String) : Boolean = nullable(ders(s.toList, r))
-var regex = NTIMES(CHAR('a'),5)
-println(matcher(regex,"aaaaa"))
-
//one or zero
def OPT(r: Rexp) = ALT(r, ONE)
+
+// Test Cases
+
//evil regular expressions
def EVIL1(n: Int) = SEQ(NTIMES(OPT(CHAR('a')), n), NTIMES(CHAR('a'), n))
val EVIL2 = SEQ(STAR(STAR(CHAR('a'))), CHAR('b'))
@@ -83,20 +86,43 @@
//test: (a?{n}) (a{n})
-for (i <- 1 to 9001 by 10) {
+for (i <- 1 to 11001 by 1000) {
println(i + " " + "%.5f".format(time_needed(2, matcher(EVIL1(i), "a" * i))))
}
-for (i <- 1 to 9001 by 1000) {
+for (i <- 1 to 11001 by 1000) {
println(i + " " + "%.5f".format(time_needed(2, matcher(EVIL1(i), "a" * i))))
}
//test: (a*)* b
-for (i <- 1 to 7000001 by 500000) {
+for (i <- 1 to 6000001 by 500000) {
+ println(i + " " + "%.5f".format(time_needed(2, matcher(EVIL2, "a" * i))))
+}
+
+for (i <- 1 to 6000001 by 500000) {
println(i + " " + "%.5f".format(time_needed(2, matcher(EVIL2, "a" * i))))
}
-for (i <- 1 to 7000001 by 500000) {
- println(i + " " + "%.5f".format(time_needed(2, matcher(EVIL2, "a" * i))))
+
+
+// size of a regular expressions - for testing purposes
+def size(r: Rexp) : Int = r match {
+ case ZERO => 1
+ case ONE => 1
+ case CHAR(_) => 1
+ case ALT(r1, r2) => 1 + size(r1) + size(r2)
+ case SEQ(r1, r2) => 1 + size(r1) + size(r2)
+ case STAR(r) => 1 + size(r)
+ case NTIMES(r, _) => 1 + size(r)
}
+
+// now the size of the derivatives grows
+// much, much slower
+
+size(ders("".toList, EVIL2)) // 5
+size(ders("a".toList, EVIL2)) // 8
+size(ders("aa".toList, EVIL2)) // 8
+size(ders("aaa".toList, EVIL2)) // 8
+size(ders("aaaa".toList, EVIL2)) // 8
+size(ders("aaaaa".toList, EVIL2)) // 8
--- a/progs/re4.scala Tue Sep 26 14:07:29 2017 +0100
+++ b/progs/re4.scala Tue Sep 26 14:08:49 2017 +0100
@@ -1,4 +1,6 @@
-
+// Version which attempts to move whole strings, not
+// just characters, under derivatives whenever possible
+
abstract class Rexp
case object ZERO extends Rexp
case object ONE extends Rexp
@@ -50,6 +52,7 @@
case r => r
}
+// *new*
// derivative w.r.t. a string (iterates der)
def ders2(s: List[Char], r: Rexp) : Rexp = (s, r) match {
case (Nil, r) => r
@@ -68,6 +71,9 @@
//one or zero
def OPT(r: Rexp) = ALT(r, ONE)
+
+// Test Cases
+
def EVIL1(n: Int) = SEQ(NTIMES(OPT(CHAR('a')), n), NTIMES(CHAR('a'), n))
val EVIL2 = SEQ(STAR(STAR(CHAR('a'))), CHAR('b'))
@@ -79,7 +85,6 @@
(end - start)/(i * 1.0e9)
}
-// warmup
//test: (a?{n}) (a{n})
for (i <- 1 to 7000001 by 500000) {
println(i + " " + "%.5f".format(time_needed(2, matcher(EVIL1(i), "a" * i))))
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/progs/thompson.scala Tue Sep 26 14:08:49 2017 +0100
@@ -0,0 +1,177 @@
+// Thompson Construction
+// (needs :load dfa.scala
+// :load nfa.scala
+// :load enfa.scala)
+
+
+// states for Thompson construction
+case class TState(i: Int) extends State
+
+object TState {
+ var counter = 0
+
+ def apply() : TState = {
+ counter += 1;
+ new TState(counter - 1)
+ }
+}
+
+
+// some types abbreviations
+type NFAt = NFA[TState, Char]
+type NFAtrans = (TState, Char) :=> Set[TState]
+type eNFAtrans = (TState, Option[Char]) :=> Set[TState]
+
+
+// for composing an eNFA transition with a NFA transition
+implicit class RichPF(val f: eNFAtrans) extends AnyVal {
+ def +++(g: NFAtrans) : eNFAtrans =
+ { case (q, None) => applyOrElse(f, (q, None))
+ case (q, Some(c)) => applyOrElse(f, (q, Some(c))) | applyOrElse(g, (q, c)) }
+}
+
+
+// NFA that does not accept any string
+def NFA_ZERO(): NFAt = {
+ val Q = TState()
+ NFA(Set(Q), { case _ => Set() }, Set())
+}
+
+// NFA that accepts the empty string
+def NFA_ONE() : NFAt = {
+ val Q = TState()
+ NFA(Set(Q), { case _ => Set() }, Set(Q))
+}
+
+// NFA that accepts the string "c"
+def NFA_CHAR(c: Char) : NFAt = {
+ val Q1 = TState()
+ val Q2 = TState()
+ NFA(Set(Q1), { case (Q1, d) if (c == d) => Set(Q2) }, Set(Q2))
+}
+
+// sequence of two NFAs
+def NFA_SEQ(enfa1: NFAt, enfa2: NFAt) : NFAt = {
+ val new_delta : eNFAtrans =
+ { case (q, None) if enfa1.fins(q) => enfa2.starts }
+
+ eNFA(enfa1.starts, new_delta +++ enfa1.delta +++ enfa2.delta,
+ enfa2.fins)
+}
+
+// alternative of two NFAs
+def NFA_ALT(enfa1: NFAt, enfa2: NFAt) : NFAt = {
+ val new_delta : NFAtrans = {
+ case (q, c) => applyOrElse(enfa1.delta, (q, c)) |
+ applyOrElse(enfa2.delta, (q, c)) }
+ val new_fins = (q: TState) => enfa1.fins(q) || enfa2.fins(q)
+
+ NFA(enfa1.starts | enfa2.starts, new_delta, new_fins)
+}
+
+// star of a NFA
+def NFA_STAR(enfa: NFAt) : NFAt = {
+ val Q = TState()
+ val new_delta : eNFAtrans =
+ { case (Q, None) => enfa.starts
+ case (q, None) if enfa.fins(q) => Set(Q) }
+
+ eNFA(Set(Q), new_delta +++ enfa.delta, Set(Q))
+}
+
+
+
+// regular expressions
+abstract class Rexp
+case object ZERO extends Rexp // matches nothing
+case object ONE extends Rexp // matches the empty string
+case class CHAR(c: Char) extends Rexp // matches a character c
+case class ALT(r1: Rexp, r2: Rexp) extends Rexp // alternative
+case class SEQ(r1: Rexp, r2: Rexp) extends Rexp // sequence
+case class STAR(r: Rexp) extends Rexp // star
+
+
+
+
+// thompson construction
+def thompson (r: Rexp) : NFAt = r match {
+ case ZERO => NFA_ZERO()
+ case ONE => NFA_ONE()
+ case CHAR(c) => NFA_CHAR(c)
+ case ALT(r1, r2) => NFA_ALT(thompson(r1), thompson(r2))
+ case SEQ(r1, r2) => NFA_SEQ(thompson(r1), thompson(r2))
+ case STAR(r1) => NFA_STAR(thompson(r1))
+}
+
+//optional regular expression (one or zero times)
+def OPT(r: Rexp) = ALT(r, ONE)
+
+//n-times regular expression (explicitly expanded)
+def NTIMES(r: Rexp, n: Int) : Rexp = n match {
+ case 0 => ONE
+ case 1 => r
+ case n => SEQ(r, NTIMES(r, n - 1))
+}
+
+
+def tmatches(r: Rexp, s: String) : Boolean =
+ thompson(r).accepts(s.toList)
+
+def tmatches2(r: Rexp, s: String) : Boolean =
+ thompson(r).accepts2(s.toList)
+
+// dfa via subset construction
+def tmatches_dfa(r: Rexp, s: String) : Boolean =
+ subset(thompson(r)).accepts(s.toList)
+
+// Test Cases
+
+
+// the evil regular expression a?{n} a{n}
+def EVIL1(n: Int) : Rexp = SEQ(NTIMES(OPT(CHAR('a')), n), NTIMES(CHAR('a'), n))
+
+// the evil regular expression (a*)*b
+val EVIL2 : Rexp = SEQ(STAR(STAR(CHAR('a'))), CHAR('b'))
+
+//for measuring time
+def time_needed[T](i: Int, code: => T) = {
+ val start = System.nanoTime()
+ for (j <- 1 to i) code
+ val end = System.nanoTime()
+ (end - start)/(i * 1.0e9)
+}
+
+// the size of the NFA can be large,
+// thus slowing down the breadth-first search
+
+for (i <- 1 to 13) {
+ println(i + ": " + "%.5f".format(time_needed(2, tmatches(EVIL1(i), "a" * i))))
+}
+
+for (i <- 1 to 100 by 5) {
+ println(i + " " + "%.5f".format(time_needed(2, tmatches(EVIL2, "a" * i))))
+}
+
+
+// the backtracking needed in depth-first search
+// can be painfully slow
+
+for (i <- 1 to 8) {
+ println(i + " " + "%.5f".format(time_needed(2, tmatches2(EVIL2, "a" * i))))
+}
+
+
+
+// while my thompson-enfa-subset-partial-function-chain
+// is probably not the most effcient way to obtain a fast DFA
+// (the below should be much faster with a more direct construction),
+// in general the DFAs can be slow because of the state explosion
+// in the subset construction
+
+for (i <- 1 to 13) {
+ println(i + ": " + "%.5f".format(time_needed(2, tmatches_dfa(EVIL1(i), "a" * i))))
+}
+
+for (i <- 1 to 100 by 5) {
+ println(i + " " + "%.5f".format(time_needed(2, tmatches_dfa(EVIL2, "a" * i))))
+}
Binary file slides/slides01.pdf has changed
--- a/slides/slides01.tex Tue Sep 26 14:07:29 2017 +0100
+++ b/slides/slides01.tex Tue Sep 26 14:08:49 2017 +0100
@@ -42,7 +42,7 @@
\begin{center}
\begin{tabular}{ll}
Email: & christian.urban at kcl.ac.uk\\
- Office: & S1.27 (1st floor Strand Building)\\
+ Office: & N7.07 (North Wing, Bush House)\\
Slides: & KEATS
\end{tabular}
\end{center}
--- a/slides/slides02.tex Tue Sep 26 14:07:29 2017 +0100
+++ b/slides/slides02.tex Tue Sep 26 14:08:49 2017 +0100
@@ -35,7 +35,7 @@
\begin{center}
\begin{tabular}{ll}
Email: & christian.urban at kcl.ac.uk\\
- Office: & S1.27 (1st floor Strand Building)\\
+ Office: & N7.07 (North Wing, Bush House)\\
Slides: & KEATS
\end{tabular}
\end{center}
--- a/slides/slides03.tex Tue Sep 26 14:07:29 2017 +0100
+++ b/slides/slides03.tex Tue Sep 26 14:08:49 2017 +0100
@@ -28,7 +28,7 @@
\begin{center}
\begin{tabular}{lp{8cm}}
Email: & christian.urban at kcl.ac.uk\\
- Office: & S1.27 (1st floor Strand Building)\\
+ Office: & N7.07 (North Wing, Bush House)\\
Slides: & KEATS (also home work and coursework is there)\\
\end{tabular}
\end{center}
--- a/slides/slides04.tex Tue Sep 26 14:07:29 2017 +0100
+++ b/slides/slides04.tex Tue Sep 26 14:08:49 2017 +0100
@@ -27,7 +27,7 @@
\begin{center}
\begin{tabular}{ll}
Email: & christian.urban at kcl.ac.uk\\
- Office: & S1.27 (1st floor Strand Building)\\
+ Office: & N7.07 (North Wing, Bush House)\\
Slides: & KEATS (also home work is there)\\
\end{tabular}
\end{center}
--- a/slides/slides05.tex Tue Sep 26 14:07:29 2017 +0100
+++ b/slides/slides05.tex Tue Sep 26 14:08:49 2017 +0100
@@ -30,7 +30,7 @@
\begin{center}
\begin{tabular}{ll}
Email: & christian.urban at kcl.ac.uk\\
- Office: & S1.27 (1st floor Strand Building)\\
+ Office: & N7.07 (North Wing, Bush House)\\
Slides: & KEATS (also home work is there)\\
\end{tabular}
\end{center}
--- a/slides/slides06.tex Tue Sep 26 14:07:29 2017 +0100
+++ b/slides/slides06.tex Tue Sep 26 14:08:49 2017 +0100
@@ -28,7 +28,7 @@
\begin{center}
\begin{tabular}{ll}
Email: & christian.urban at kcl.ac.uk\\
- Office: & S1.27 (1st floor Strand Building)\\
+ Office: & N7.07 (North Wing, Bush House)\\
Slides: & KEATS (also home work is there)\\
\end{tabular}
\end{center}
--- a/slides/slides07.tex Tue Sep 26 14:07:29 2017 +0100
+++ b/slides/slides07.tex Tue Sep 26 14:08:49 2017 +0100
@@ -25,7 +25,7 @@
\begin{center}
\begin{tabular}{ll}
Email: & christian.urban at kcl.ac.uk\\
- Office: & S1.27 (1st floor Strand Building)\\
+ Office: & N7.07 (North Wing, Bush House)\\
Slides: & KEATS (also home work is there)\\
\end{tabular}
\end{center}
--- a/slides/slides08.tex Tue Sep 26 14:07:29 2017 +0100
+++ b/slides/slides08.tex Tue Sep 26 14:08:49 2017 +0100
@@ -23,7 +23,7 @@
\begin{center}
\begin{tabular}{ll}
Email: & christian.urban at kcl.ac.uk\\
- Office: & S1.27 (1st floor Strand Building)\\
+ Office: & N7.07 (North Wing, Bush House)\\
Slides: & KEATS (also home work is there)\\
\end{tabular}
\end{center}
--- a/slides/slides09.tex Tue Sep 26 14:07:29 2017 +0100
+++ b/slides/slides09.tex Tue Sep 26 14:08:49 2017 +0100
@@ -53,7 +53,7 @@
\begin{center}
\begin{tabular}{ll}
Email: & christian.urban at kcl.ac.uk\\
- Office: & S1.27 (1st floor Strand Building)\\
+ Office: & N7.07 (North Wing, Bush House)\\
Slides: & KEATS (also home work is there)\\
\end{tabular}
\end{center}
--- a/slides/slides10.tex Tue Sep 26 14:07:29 2017 +0100
+++ b/slides/slides10.tex Tue Sep 26 14:08:49 2017 +0100
@@ -51,7 +51,7 @@
\begin{center}
\begin{tabular}{ll}
Email: & christian.urban at kcl.ac.uk\\
- Office: & S1.27 (1st floor Strand Building)\\
+ Office: & N7.07 (North Wing, Bush House)\\
Slides: & KEATS (also home work is there)\\
\end{tabular}
\end{center}
--- a/slides/slides11.tex Tue Sep 26 14:07:29 2017 +0100
+++ b/slides/slides11.tex Tue Sep 26 14:08:49 2017 +0100
@@ -49,7 +49,7 @@
\begin{center}
\begin{tabular}{ll}
Email: & christian.urban at kcl.ac.uk\\
- Office: & S1.27 (1st floor Strand Building)\\
+ Office: & N7.07 (North Wing, Bush House)\\
Slides: & KEATS (also home work is there)\\
\end{tabular}
\end{center}
--- a/slides/slides12.tex Tue Sep 26 14:07:29 2017 +0100
+++ b/slides/slides12.tex Tue Sep 26 14:08:49 2017 +0100
@@ -25,7 +25,7 @@
\begin{center}
\begin{tabular}{ll}
Email: & christian.urban at kcl.ac.uk\\
- Office: & S1.27 (1st floor Strand Building)\\
+ Office: & N7.07 (North Wing, Bush House)\\
Slides: & KEATS (also home work is there)\\
\end{tabular}
\end{center}