diff -r 114a89518aea -r 4383809c176a cws/cw03.tex --- a/cws/cw03.tex Tue Nov 21 16:31:11 2017 +0000 +++ b/cws/cw03.tex Thu Nov 23 10:56:47 2017 +0000 @@ -1,18 +1,60 @@ \documentclass{article} \usepackage{../style} \usepackage{../langs} +\usepackage{tikz} +\usepackage{pgf} +\usepackage{pgfplots} + +\begin{filecontents}{re-python2.data} +1 0.033 +5 0.036 +10 0.034 +15 0.036 +18 0.059 +19 0.084 +20 0.141 +21 0.248 +22 0.485 +23 0.878 +24 1.71 +25 3.40 +26 7.08 +27 14.12 +28 26.69 +\end{filecontents} + +\begin{filecontents}{re-java.data} +5 0.00298 +10 0.00418 +15 0.00996 +16 0.01710 +17 0.03492 +18 0.03303 +19 0.05084 +20 0.10177 +21 0.19960 +22 0.41159 +23 0.82234 +24 1.70251 +25 3.36112 +26 6.63998 +27 13.35120 +28 29.81185 +\end{filecontents} + \begin{document} -\section*{Coursework 8 (Scala, Regular Expressions)} +\section*{Coursework 8 (Scala, Regular Expressions, Brainf***)} This coursework is worth 10\%. It is about regular expressions, -pattern matching and polymorphism. The first part is due on 30 +pattern matching and an interpreter. The first part is due on 30 November at 11pm; the second, more advanced part, is due on 21 -December at 11pm. You are asked to implement a regular expression -matcher based on derivatives of regular expressions. The reason is -that regular expression matching in Java can be extremely slow -sometimes.\bigskip +December at 11pm. In the first part, you are asked to implement a +regular expression matcher based on derivatives of regular +expressions. The reason is that regular expression matching in Java +can sometimes be extremely slow. The advanced part is about an +interpreter for a very simple programming language.\bigskip \noindent \textbf{Important:} @@ -54,9 +96,11 @@ \subsection*{Part 1 (6 Marks)} The task is to implement a regular expression matcher that is based on -derivatives of regular expressions. The implementation can deal -with the following regular expressions, which have been predefined -in the file re.scala: +derivatives of regular expressions. Most of the functions are defined by +recursion over regular expressions and can be elegantly implemented +using Scala's pattern-matching. The implementation should deal with the +following regular expressions, which have been predefined in the file +\texttt{re.scala}: \begin{center} \begin{tabular}{lcll} @@ -91,7 +135,7 @@ \begin{itemize} \item[(1a)] Implement a function, called \textit{nullable}, by recursion over regular expressions. This function tests whether a - regular expression can match the empty string, that is given a + regular expression can match the empty string. This means given a regular expression it either returns true or false. \begin{center} @@ -163,9 +207,9 @@ \item[(1c)] Implement the function \textit{simp}, which recursively traverses a regular expression from the inside to the outside, and - simplifies every sub-regular-expression on the left (see below) to - the regular expression on the right, except it does not simplify inside - ${}^*$-regular expressions. + on the way simplifies every regular expression on the left (see + below) to the regular expression on the right, except it does not + simplify inside ${}^*$-regular expressions. \begin{center} \begin{tabular}{l@{\hspace{4mm}}c@{\hspace{4mm}}ll} @@ -182,17 +226,18 @@ For example the regular expression \[(r_1 + \ZERO) \cdot \ONE + ((\ONE + r_2) + r_3) \cdot (r_4 \cdot \ZERO)\] - simplifies to just $r_1$. \textbf{Hints:} Regular expressions can be + simplifies to just $r_1$. \textbf{Hint:} Regular expressions can be seen as trees and there are several methods for traversing - trees. One of them corresponds to the inside-out traversal. Also - remember numerical expressions from school: there you had expressions + trees. One of them corresponds to the inside-out traversal, which is + sometimes also called post-order traversal. Furthermore, + remember numerical expressions from school times: there you had expressions like $u + \ldots + (1 \cdot x) - \ldots (z + (y \cdot 0)) \ldots$ and simplification rules that looked very similar to rules above. You would simplify such numerical expressions by replacing for example the $y \cdot 0$ by $0$, or $1\cdot x$ by $x$, and then look whether more rules are applicable. If you organise the simplification in an inside-out fashion, it is always clear which - rule should applied next.\\\mbox{}\hfill[1 Mark] + rule should be applied next.\hfill[2 Marks] \item[(1d)] Implement two functions: The first, called \textit{ders}, takes a list of characters and a regular expression as arguments, and @@ -214,81 +259,10 @@ according to \textit{ders} and after that tests whether the resulting derivative regular expression can match the empty string (using \textit{nullable}). For example the \textit{matcher} will produce -true given the regular expression $(a\cdot b)\cdot c$ and the string -$abc$.\\ \mbox{}\hfill[1 Mark] - -\item[(1e)] Implement the function $\textit{replace}\;r\;s_1\;s_2$: it searches - (from the left to -right) in the string $s_1$ all the non-empty substrings that match the -regular expression $r$---these substrings are assumed to be -the longest substrings matched by the regular expression and -assumed to be non-overlapping. All these substrings in $s_1$ matched by $r$ -are replaced by $s_2$. For example given the regular expression - -\[(a \cdot a)^* + (b \cdot b)\] - -\noindent the string $s_1 = aabbbaaaaaaabaaaaabbaaaabb$ and -replacement the string $s_2 = c$ yields the string - -\[ -ccbcabcaccc -\] - -\hfill[2 Marks] -\end{itemize} - - - - -\subsection*{Part 2 (4 Marks)} - -You need to copy all the code from \texttt{re.scala} into -\texttt{re2.scala} in order to complete Part 2. Parts (2a) and (2b) -give you another method and datapoints for testing the \textit{der} -and \textit{simp} functions from Part~1. - -\subsubsection*{Tasks (file re2.scala)} +true for the regular expression $(a\cdot b)\cdot c$ and the string +$abc$, but false if you give it the string $ab$. \hfill[1 Mark] -\begin{itemize} -\item[(2a)] Write a \textbf{polymorphic} function, called - \textit{iterT}, that is \textbf{tail-recursive}(!) and takes an - integer $n$, a function $f$ and an $x$ as arguments. This function - should iterate $f$ $n$-times starting with the argument $x$, like - - \[\underbrace{f(\ldots (f}_{n\text{-times}}(x))) - \] - - More formally that means \textit{iterT} behaves as follows: - - \begin{center} - \begin{tabular}{lcl} - $\textit{iterT}(n, f, x)$ & $\dn$ & - $\begin{cases} - \;x & \textit{if}\;n = 0\\ - \;f(\textit{iterT}(n - 1, f, x)) & \textit{otherwise} - \end{cases}$ - \end{tabular} -\end{center} - - Make sure you write a \textbf{tail-recursive} version of - \textit{iterT}. If you add the annotation \texttt{@tailrec} (see - below) your code should not produce an error message. - - \begin{lstlisting}[language=Scala, numbers=none, xleftmargin=-1mm] - import scala.annotation.tailrec - - @tailrec - def iterT[A](n: Int, f: A => A, x: A): A = ... - \end{lstlisting} - - You can assume that \textit{iterT} will only be called for positive - integers $0 \le n$. Given the type variable \texttt{A}, the type of - $f$ is \texttt{A => A} and the type of $x$ is \texttt{A}. This means - \textit{iterT} can be used, for example, for functions from integers - to integers, or strings to strings, or regular expressions to - regular expressions. \\ \mbox{}\hfill[2 Marks] - -\item[(2b)] Implement a function, called \textit{size}, by recursion +\item[(1e)] Implement a function, called \textit{size}, by recursion over regular expressions. If a regular expression is seen as a tree, then \textit{size} should return the number of nodes in such a tree. Therefore this function is defined as follows: @@ -304,49 +278,12 @@ \end{tabular} \end{center} -You can use \textit{size} and \textit{iterT} in order to test how much -the 'evil' regular expression $(a^*)^* \cdot b$ grows when taking -successive derivatives according the letter $a$ and then compare it to -taking the derivative, but simlifying the derivative after each step. -For example, the calls - - \begin{lstlisting}[language=Scala, numbers=none, xleftmargin=-1mm] - size(iterT(20, (r: Rexp) => der('a', r), EVIL)) - size(iterT(20, (r: Rexp) => simp(der('a', r)), EVIL)) - \end{lstlisting} - - produce without simplification a regular expression of size of - 7340068 after 20 iterations, while the one with - simplification gives - just 8.\\ \mbox{}\hfill[1 Mark] - - -\item[(2c)] Write a \textbf{polymorphic} function, called - \textit{fixpT}, that takes - a function $f$ and an $x$ as arguments. The purpose - of \textit{fixpT} is to calculate a fixpoint of the function $f$ - starting from the argument $x$. - A fixpoint, say $y$, is when $f(y) = y$ holds. - That means \textit{fixpT} behaves as follows: - - \begin{center} - \begin{tabular}{lcl} - $\textit{fixpT}(f, x)$ & $\dn$ & - $\begin{cases} - \;x & \textit{if}\;f(x) = x\\ - \;\textit{fixpT}(f, f(x)) & \textit{otherwise} - \end{cases}$ - \end{tabular} -\end{center} - - Make sure you calculate in the code of $\textit{fixpT}$ the result - of $f(x)$ only once. Given the type variable \texttt{A} in - $\textit{fixpT}$, the type of $f$ is \texttt{A => A} and the type of - $x$ is \texttt{A}. The file \texttt{re2.scala} gives two example - function where in one the fixpoint is 1 and in the other - it is the string $a$.\\ \mbox{}\hfill[1 Mark] -\end{itemize}\bigskip - +You can use \textit{size} in order to test how much the `evil' regular +expression $(a^*)^* \cdot b$ grows when taking successive derivatives +according the letter $a$ without simplification and then compare it to +taking the derivative, but simplify the result. The sizes +are given in \texttt{re.scala}. \hfill[1 Mark] +\end{itemize} \subsection*{Background} @@ -354,36 +291,101 @@ function might not so easy to be seen. To understand its purpose better, assume a regular expression $r$ can match strings of the form $c\!::\!cs$ (that means strings which start with a character $c$ and have -some rest, or tail, $cs$). If you now take the derivative of $r$ with -respect to the character $c$, then you obtain a regular expressions +some rest, or tail, $cs$). If you take the derivative of $r$ with +respect to the character $c$, then you obtain a regular expression that can match all the strings $cs$. In other words, the regular -expression $\textit{der}\;c\;r$ can match the same strings $c::cs$ +expression $\textit{der}\;c\;r$ can match the same strings $c\!::\!cs$ that can be matched by $r$, except that the $c$ is chopped off. Assume now $r$ can match the string $abc$. If you take the derivative according to $a$ then you obtain a regular expression that can match $bc$ (it is $abc$ where the $a$ has been chopped off). If you now -build the derivative $\textit{der}\;b\;(\textit{der}\;a\;r))$ you +build the derivative $\textit{der}\;b\;(\textit{der}\;a\;r)$ you obtain a regular expression that can match the string $c$ (it is $bc$ where $b$ is chopped off). If you finally build the derivative of this according $c$, that is -$\textit{der}\;c\;(\textit{der}\;b\;(\textit{der}\;a\;r)))$, you -obtain a regular expression that can match the empty string. You can -test this using the function nullable, which is what your matcher is -doing. +$\textit{der}\;c\;(\textit{der}\;b\;(\textit{der}\;a\;r))$, you obtain +a regular expression that can match the empty string. You can test +whether this is indeed the case using the function nullable, which is +what your matcher is doing. -The purpose of the simp function is to keep the regular expression -small. Normally the derivative function makes the regular expression -bigger (see the SEQ case and the example in (2b)) and the algorithm -would be slower and slower over time. The simp function counters this -increase in size and the result is that the algorithm is fast -throughout. By the way, this algorithm is by Janusz Brzozowski who -came up with the idea of derivatives in 1964 in his PhD thesis. +The purpose of the $\textit{simp}$ function is to keep the regular +expression small. Normally the derivative function makes the regular +expression bigger (see the SEQ case and the example in (1b)) and the +algorithm would be slower and slower over time. The $\textit{simp}$ +function counters this increase in size and the result is that the +algorithm is fast throughout. By the way, this algorithm is by Janusz +Brzozowski who came up with the idea of derivatives in 1964 in his PhD +thesis. \begin{center}\small \url{https://en.wikipedia.org/wiki/Janusz_Brzozowski_(computer_scientist)} \end{center} + +If you want to see how badly the regular expression matchers do in +Java and Python with the `evil' regular expression, then have a look +at the graphs below (you can try it out for yourself: have a look at +the file \texttt{catastrophic.java} on KEATS). Compare this with the +matcher you have implemented. How long can the string of $a$'s be +in your matcher and stay within the 30 seconds time limit? + +\begin{center} +\begin{tikzpicture} +\begin{axis}[ + title={Graph: $(a^*)^*\cdot b$ and strings + $\underbrace{a\ldots a}_{n}$}, + xlabel={$n$}, + x label style={at={(1.05,0.0)}}, + ylabel={time in secs}, + enlargelimits=false, + xtick={0,5,...,30}, + xmax=33, + ymax=35, + ytick={0,5,...,30}, + scaled ticks=false, + axis lines=left, + width=6cm, + height=5.0cm, + legend entries={Python, Java}, + legend pos=outer north east] +\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data}; +\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data}; +\end{axis} +\end{tikzpicture} +\end{center} +\newpage + +\subsection*{Part 2 (4 Marks)} + +Comming from Java or C++, you might think Scala is a quite +esotheric programming language. But remember, some serious companies +have built their business on Scala. And there are far more esotheric +languages out there. One is called \emph{brainf***}. Urban M\"uller +developed this language in 1993. A close relative was already +introduced in ... by Corado B\"ohm, an Italian computer pionier, who +unfortunately died a few months ago. One feature of brainf*** is its +minimalistic set of instructions. It has just 8 instructions, all of +which are single characters. Despite this minimalism, this language, +given enough memory, has been shown to be Turing complete. In this +part you will implement an interpreter for this language. + + + +\subsubsection*{Tasks (file bf.scala)} + +\begin{itemize} +\item[(2a)] + +\item[(2b)] + +\item[(2c)] + +\end{itemize}\bigskip + + + + \end{document}