cws/cw03.tex
changeset 153 4383809c176a
parent 152 114a89518aea
child 154 39c6b93718f0
equal deleted inserted replaced
152:114a89518aea 153:4383809c176a
     1 \documentclass{article}
     1 \documentclass{article}
     2 \usepackage{../style}
     2 \usepackage{../style}
     3 \usepackage{../langs}
     3 \usepackage{../langs}
       
     4 \usepackage{tikz}
       
     5 \usepackage{pgf}
       
     6 \usepackage{pgfplots}
       
     7 
       
     8 \begin{filecontents}{re-python2.data}
       
     9 1 0.033
       
    10 5 0.036
       
    11 10 0.034
       
    12 15 0.036
       
    13 18 0.059
       
    14 19 0.084
       
    15 20 0.141
       
    16 21 0.248
       
    17 22 0.485
       
    18 23 0.878
       
    19 24 1.71
       
    20 25 3.40
       
    21 26 7.08
       
    22 27 14.12
       
    23 28 26.69
       
    24 \end{filecontents}
       
    25 
       
    26 \begin{filecontents}{re-java.data}
       
    27 5  0.00298
       
    28 10  0.00418
       
    29 15  0.00996
       
    30 16  0.01710
       
    31 17  0.03492
       
    32 18  0.03303
       
    33 19  0.05084
       
    34 20  0.10177
       
    35 21  0.19960
       
    36 22  0.41159
       
    37 23  0.82234
       
    38 24  1.70251
       
    39 25  3.36112
       
    40 26  6.63998
       
    41 27  13.35120
       
    42 28  29.81185
       
    43 \end{filecontents}
       
    44 
     4 
    45 
     5 \begin{document}
    46 \begin{document}
     6 
    47 
     7 \section*{Coursework 8 (Scala, Regular Expressions)}
    48 \section*{Coursework 8 (Scala, Regular Expressions, Brainf***)}
     8 
    49 
     9 This coursework is worth 10\%. It is about regular expressions,
    50 This coursework is worth 10\%. It is about regular expressions,
    10 pattern matching and polymorphism. The first part is due on 30
    51 pattern matching and an interpreter. The first part is due on 30
    11 November at 11pm; the second, more advanced part, is due on 21
    52 November at 11pm; the second, more advanced part, is due on 21
    12 December at 11pm. You are asked to implement a regular expression
    53 December at 11pm. In the first part, you are asked to implement a
    13 matcher based on derivatives of regular expressions. The reason is
    54 regular expression matcher based on derivatives of regular
    14 that regular expression matching in Java can be extremely slow
    55 expressions. The reason is that regular expression matching in Java
    15 sometimes.\bigskip
    56 can sometimes be extremely slow. The advanced part is about an
       
    57 interpreter for a very simple programming language.\bigskip
    16 
    58 
    17 \noindent
    59 \noindent
    18 \textbf{Important:}
    60 \textbf{Important:}
    19 
    61 
    20 \begin{itemize}
    62 \begin{itemize}
    52 
    94 
    53 
    95 
    54 \subsection*{Part 1 (6 Marks)}
    96 \subsection*{Part 1 (6 Marks)}
    55 
    97 
    56 The task is to implement a regular expression matcher that is based on
    98 The task is to implement a regular expression matcher that is based on
    57 derivatives of regular expressions. The implementation can deal
    99 derivatives of regular expressions. Most of the functions are defined by
    58 with the following regular expressions, which have been predefined
   100 recursion over regular expressions and can be elegantly implemented
    59 in the file re.scala:
   101 using Scala's pattern-matching. The implementation should deal with the
       
   102 following regular expressions, which have been predefined in the file
       
   103 \texttt{re.scala}:
    60 
   104 
    61 \begin{center}
   105 \begin{center}
    62 \begin{tabular}{lcll}
   106 \begin{tabular}{lcll}
    63   $r$ & $::=$ & $\ZERO$     & cannot match anything\\
   107   $r$ & $::=$ & $\ZERO$     & cannot match anything\\
    64       &   $|$ & $\ONE$      & can only match the empty string\\
   108       &   $|$ & $\ONE$      & can only match the empty string\\
    89 \subsubsection*{Tasks (file re.scala)}
   133 \subsubsection*{Tasks (file re.scala)}
    90 
   134 
    91 \begin{itemize}
   135 \begin{itemize}
    92 \item[(1a)] Implement a function, called \textit{nullable}, by
   136 \item[(1a)] Implement a function, called \textit{nullable}, by
    93   recursion over regular expressions. This function tests whether a
   137   recursion over regular expressions. This function tests whether a
    94   regular expression can match the empty string, that is given a
   138   regular expression can match the empty string. This means given a
    95   regular expression it either returns true or false.
   139   regular expression it either returns true or false.
    96 
   140 
    97 \begin{center}
   141 \begin{center}
    98 \begin{tabular}{lcl}
   142 \begin{tabular}{lcl}
    99 $\textit{nullable}(\ZERO)$ & $\dn$ & $\textit{false}$\\
   143 $\textit{nullable}(\ZERO)$ & $\dn$ & $\textit{false}$\\
   161 Note, the last derivative can match the empty string, that is it is \textit{nullable}.\\
   205 Note, the last derivative can match the empty string, that is it is \textit{nullable}.\\
   162 \mbox{}\hfill\mbox{[1 Mark]}
   206 \mbox{}\hfill\mbox{[1 Mark]}
   163 
   207 
   164 \item[(1c)] Implement the function \textit{simp}, which recursively
   208 \item[(1c)] Implement the function \textit{simp}, which recursively
   165   traverses a regular expression from the inside to the outside, and
   209   traverses a regular expression from the inside to the outside, and
   166   simplifies every sub-regular-expression on the left (see below) to
   210   on the way simplifies every regular expression on the left (see
   167   the regular expression on the right, except it does not simplify inside
   211   below) to the regular expression on the right, except it does not
   168   ${}^*$-regular expressions.
   212   simplify inside ${}^*$-regular expressions.
   169 
   213 
   170   \begin{center}
   214   \begin{center}
   171 \begin{tabular}{l@{\hspace{4mm}}c@{\hspace{4mm}}ll}
   215 \begin{tabular}{l@{\hspace{4mm}}c@{\hspace{4mm}}ll}
   172 $r \cdot \ZERO$ & $\mapsto$ & $\ZERO$\\ 
   216 $r \cdot \ZERO$ & $\mapsto$ & $\ZERO$\\ 
   173 $\ZERO \cdot r$ & $\mapsto$ & $\ZERO$\\ 
   217 $\ZERO \cdot r$ & $\mapsto$ & $\ZERO$\\ 
   180   \end{center}
   224   \end{center}
   181 
   225 
   182   For example the regular expression
   226   For example the regular expression
   183   \[(r_1 + \ZERO) \cdot \ONE + ((\ONE + r_2) + r_3) \cdot (r_4 \cdot \ZERO)\]
   227   \[(r_1 + \ZERO) \cdot \ONE + ((\ONE + r_2) + r_3) \cdot (r_4 \cdot \ZERO)\]
   184 
   228 
   185   simplifies to just $r_1$. \textbf{Hints:} Regular expressions can be
   229   simplifies to just $r_1$. \textbf{Hint:} Regular expressions can be
   186   seen as trees and there are several methods for traversing
   230   seen as trees and there are several methods for traversing
   187   trees. One of them corresponds to the inside-out traversal. Also
   231   trees. One of them corresponds to the inside-out traversal, which is
   188   remember numerical expressions from school: there you had expressions
   232   sometimes also called post-order traversal. Furthermore,
       
   233   remember numerical expressions from school times: there you had expressions
   189   like $u + \ldots + (1 \cdot x) - \ldots (z + (y \cdot 0)) \ldots$
   234   like $u + \ldots + (1 \cdot x) - \ldots (z + (y \cdot 0)) \ldots$
   190   and simplification rules that looked very similar to rules
   235   and simplification rules that looked very similar to rules
   191   above. You would simplify such numerical expressions by replacing
   236   above. You would simplify such numerical expressions by replacing
   192   for example the $y \cdot 0$ by $0$, or $1\cdot x$ by $x$, and then
   237   for example the $y \cdot 0$ by $0$, or $1\cdot x$ by $x$, and then
   193   look whether more rules are applicable. If you organise the
   238   look whether more rules are applicable. If you organise the
   194   simplification in an inside-out fashion, it is always clear which
   239   simplification in an inside-out fashion, it is always clear which
   195   rule should applied next.\\\mbox{}\hfill[1 Mark]
   240   rule should be applied next.\hfill[2 Marks]
   196 
   241 
   197 \item[(1d)] Implement two functions: The first, called \textit{ders},
   242 \item[(1d)] Implement two functions: The first, called \textit{ders},
   198   takes a list of characters and a regular expression as arguments, and
   243   takes a list of characters and a regular expression as arguments, and
   199   builds the derivative w.r.t.~the list as follows:
   244   builds the derivative w.r.t.~the list as follows:
   200 
   245 
   212 The second function, called \textit{matcher}, takes a string and a
   257 The second function, called \textit{matcher}, takes a string and a
   213 regular expression as arguments. It builds first the derivatives
   258 regular expression as arguments. It builds first the derivatives
   214 according to \textit{ders} and after that tests whether the resulting
   259 according to \textit{ders} and after that tests whether the resulting
   215 derivative regular expression can match the empty string (using
   260 derivative regular expression can match the empty string (using
   216 \textit{nullable}).  For example the \textit{matcher} will produce
   261 \textit{nullable}).  For example the \textit{matcher} will produce
   217 true given the regular expression $(a\cdot b)\cdot c$ and the string
   262 true for the regular expression $(a\cdot b)\cdot c$ and the string
   218 $abc$.\\  \mbox{}\hfill[1 Mark]
   263 $abc$, but false if you give it the string $ab$. \hfill[1 Mark]
   219 
   264 
   220 \item[(1e)] Implement the function $\textit{replace}\;r\;s_1\;s_2$: it searches
   265 \item[(1e)] Implement a function, called \textit{size}, by recursion
   221   (from the left to 
       
   222 right) in the string $s_1$ all the non-empty substrings that match the 
       
   223 regular expression $r$---these substrings are assumed to be
       
   224 the longest substrings matched by the regular expression and
       
   225 assumed to be non-overlapping. All these substrings in $s_1$ matched by $r$
       
   226 are replaced by $s_2$. For example given the regular expression
       
   227 
       
   228 \[(a \cdot a)^* + (b \cdot b)\]
       
   229 
       
   230 \noindent the string $s_1 = aabbbaaaaaaabaaaaabbaaaabb$ and
       
   231 replacement the string $s_2 = c$ yields the string
       
   232 
       
   233 \[
       
   234 ccbcabcaccc
       
   235 \]
       
   236 
       
   237 \hfill[2 Marks]
       
   238 \end{itemize}
       
   239 
       
   240 
       
   241 
       
   242 
       
   243 \subsection*{Part 2 (4 Marks)}
       
   244 
       
   245 You need to copy all the code from \texttt{re.scala} into
       
   246 \texttt{re2.scala} in order to complete Part 2.  Parts (2a) and (2b)
       
   247 give you another method and datapoints for testing the \textit{der}
       
   248 and \textit{simp} functions from Part~1.
       
   249 
       
   250 \subsubsection*{Tasks (file re2.scala)}
       
   251 
       
   252 \begin{itemize}
       
   253 \item[(2a)] Write a \textbf{polymorphic} function, called
       
   254   \textit{iterT}, that is \textbf{tail-recursive}(!) and takes an
       
   255   integer $n$, a function $f$ and an $x$ as arguments. This function
       
   256   should iterate $f$ $n$-times starting with the argument $x$, like
       
   257 
       
   258   \[\underbrace{f(\ldots (f}_{n\text{-times}}(x)))
       
   259   \]
       
   260 
       
   261   More formally that means \textit{iterT} behaves as follows:
       
   262 
       
   263   \begin{center}
       
   264     \begin{tabular}{lcl}
       
   265       $\textit{iterT}(n, f, x)$ & $\dn$ &
       
   266       $\begin{cases}
       
   267         \;x & \textit{if}\;n = 0\\
       
   268         \;f(\textit{iterT}(n - 1, f, x)) & \textit{otherwise}
       
   269         \end{cases}$      
       
   270     \end{tabular}
       
   271 \end{center}
       
   272 
       
   273   Make sure you write a \textbf{tail-recursive} version of
       
   274   \textit{iterT}.  If you add the annotation \texttt{@tailrec} (see
       
   275   below) your code should not produce an error message.
       
   276 
       
   277   \begin{lstlisting}[language=Scala, numbers=none, xleftmargin=-1mm]  
       
   278   import scala.annotation.tailrec
       
   279 
       
   280   @tailrec
       
   281   def iterT[A](n: Int, f: A => A, x: A): A = ...
       
   282   \end{lstlisting}
       
   283 
       
   284   You can assume that \textit{iterT} will only be called for positive
       
   285   integers $0 \le n$. Given the type variable \texttt{A}, the type of
       
   286   $f$ is \texttt{A => A} and the type of $x$ is \texttt{A}. This means
       
   287   \textit{iterT} can be used, for example, for functions from integers
       
   288   to integers, or strings to strings, or regular expressions to
       
   289   regular expressions.  \\ \mbox{}\hfill[2 Marks]
       
   290 
       
   291 \item[(2b)] Implement a function, called \textit{size}, by recursion
       
   292   over regular expressions. If a regular expression is seen as a tree,
   266   over regular expressions. If a regular expression is seen as a tree,
   293   then \textit{size} should return the number of nodes in such a
   267   then \textit{size} should return the number of nodes in such a
   294   tree. Therefore this function is defined as follows:
   268   tree. Therefore this function is defined as follows:
   295 
   269 
   296 \begin{center}
   270 \begin{center}
   302 $\textit{size}(r_1 \cdot r_2)$ & $\dn$ & $1 + \textit{size}(r_1) + \textit{size}(r_2)$\\
   276 $\textit{size}(r_1 \cdot r_2)$ & $\dn$ & $1 + \textit{size}(r_1) + \textit{size}(r_2)$\\
   303 $\textit{size}(r^*)$ & $\dn$ & $1 + \textit{size}(r)$\\
   277 $\textit{size}(r^*)$ & $\dn$ & $1 + \textit{size}(r)$\\
   304 \end{tabular}
   278 \end{tabular}
   305 \end{center}
   279 \end{center}
   306 
   280 
   307 You can use \textit{size} and \textit{iterT} in order to test how much
   281 You can use \textit{size} in order to test how much the `evil' regular
   308 the 'evil' regular expression $(a^*)^* \cdot b$ grows when taking
   282 expression $(a^*)^* \cdot b$ grows when taking successive derivatives
   309 successive derivatives according the letter $a$ and then compare it to
   283 according the letter $a$ without simplification and then compare it to
   310 taking the derivative, but simlifying the derivative after each step.
   284 taking the derivative, but simplify the result.  The sizes
   311 For example, the calls
   285 are given in \texttt{re.scala}. \hfill[1 Mark]
   312 
   286 \end{itemize}
   313   \begin{lstlisting}[language=Scala, numbers=none, xleftmargin=-1mm]  
       
   314   size(iterT(20, (r: Rexp) => der('a', r), EVIL))
       
   315   size(iterT(20, (r: Rexp) => simp(der('a', r)), EVIL))
       
   316   \end{lstlisting}
       
   317 
       
   318   produce without simplification a regular expression of size of
       
   319   7340068 after 20 iterations, while the one with
       
   320   simplification gives
       
   321   just 8.\\ \mbox{}\hfill[1 Mark]
       
   322   
       
   323 
       
   324 \item[(2c)] Write a \textbf{polymorphic} function, called
       
   325   \textit{fixpT}, that takes
       
   326   a function $f$ and an $x$ as arguments. The purpose
       
   327   of \textit{fixpT} is to calculate a fixpoint of the function $f$
       
   328   starting from the argument $x$.
       
   329   A fixpoint, say $y$, is when $f(y) = y$ holds. 
       
   330   That means \textit{fixpT} behaves as follows:
       
   331 
       
   332   \begin{center}
       
   333     \begin{tabular}{lcl}
       
   334       $\textit{fixpT}(f, x)$ & $\dn$ &
       
   335       $\begin{cases}
       
   336         \;x & \textit{if}\;f(x) = x\\
       
   337         \;\textit{fixpT}(f, f(x)) & \textit{otherwise}
       
   338         \end{cases}$      
       
   339     \end{tabular}
       
   340 \end{center}
       
   341 
       
   342   Make sure you calculate in the code of $\textit{fixpT}$ the result
       
   343   of $f(x)$ only once. Given the type variable \texttt{A} in
       
   344   $\textit{fixpT}$, the type of $f$ is \texttt{A => A} and the type of
       
   345   $x$ is \texttt{A}. The file \texttt{re2.scala} gives two example
       
   346   function where in one the fixpoint is 1 and in the other
       
   347   it is the string $a$.\\ \mbox{}\hfill[1 Mark]  
       
   348 \end{itemize}\bigskip  
       
   349 
       
   350 
   287 
   351 \subsection*{Background}
   288 \subsection*{Background}
   352 
   289 
   353 Although easily implementable in Scala, the idea behind the derivative
   290 Although easily implementable in Scala, the idea behind the derivative
   354 function might not so easy to be seen. To understand its purpose
   291 function might not so easy to be seen. To understand its purpose
   355 better, assume a regular expression $r$ can match strings of the form
   292 better, assume a regular expression $r$ can match strings of the form
   356 $c\!::\!cs$ (that means strings which start with a character $c$ and have
   293 $c\!::\!cs$ (that means strings which start with a character $c$ and have
   357 some rest, or tail, $cs$). If you now take the derivative of $r$ with
   294 some rest, or tail, $cs$). If you take the derivative of $r$ with
   358 respect to the character $c$, then you obtain a regular expressions
   295 respect to the character $c$, then you obtain a regular expression
   359 that can match all the strings $cs$.  In other words, the regular
   296 that can match all the strings $cs$.  In other words, the regular
   360 expression $\textit{der}\;c\;r$ can match the same strings $c::cs$
   297 expression $\textit{der}\;c\;r$ can match the same strings $c\!::\!cs$
   361 that can be matched by $r$, except that the $c$ is chopped off.
   298 that can be matched by $r$, except that the $c$ is chopped off.
   362 
   299 
   363 Assume now $r$ can match the string $abc$. If you take the derivative
   300 Assume now $r$ can match the string $abc$. If you take the derivative
   364 according to $a$ then you obtain a regular expression that can match
   301 according to $a$ then you obtain a regular expression that can match
   365 $bc$ (it is $abc$ where the $a$ has been chopped off). If you now
   302 $bc$ (it is $abc$ where the $a$ has been chopped off). If you now
   366 build the derivative $\textit{der}\;b\;(\textit{der}\;a\;r))$ you
   303 build the derivative $\textit{der}\;b\;(\textit{der}\;a\;r)$ you
   367 obtain a regular expression that can match the string $c$ (it is $bc$
   304 obtain a regular expression that can match the string $c$ (it is $bc$
   368 where $b$ is chopped off). If you finally build the derivative of this
   305 where $b$ is chopped off). If you finally build the derivative of this
   369 according $c$, that is
   306 according $c$, that is
   370 $\textit{der}\;c\;(\textit{der}\;b\;(\textit{der}\;a\;r)))$, you
   307 $\textit{der}\;c\;(\textit{der}\;b\;(\textit{der}\;a\;r))$, you obtain
   371 obtain a regular expression that can match the empty string. You can
   308 a regular expression that can match the empty string. You can test
   372 test this using the function nullable, which is what your matcher is
   309 whether this is indeed the case using the function nullable, which is
   373 doing.
   310 what your matcher is doing.
   374 
   311 
   375 The purpose of the simp function is to keep the regular expression
   312 The purpose of the $\textit{simp}$ function is to keep the regular
   376 small. Normally the derivative function makes the regular expression
   313 expression small. Normally the derivative function makes the regular
   377 bigger (see the SEQ case and the example in (2b)) and the algorithm
   314 expression bigger (see the SEQ case and the example in (1b)) and the
   378 would be slower and slower over time. The simp function counters this
   315 algorithm would be slower and slower over time. The $\textit{simp}$
   379 increase in size and the result is that the algorithm is fast
   316 function counters this increase in size and the result is that the
   380 throughout.  By the way, this algorithm is by Janusz Brzozowski who
   317 algorithm is fast throughout.  By the way, this algorithm is by Janusz
   381 came up with the idea of derivatives in 1964 in his PhD thesis.
   318 Brzozowski who came up with the idea of derivatives in 1964 in his PhD
       
   319 thesis.
   382 
   320 
   383 \begin{center}\small
   321 \begin{center}\small
   384 \url{https://en.wikipedia.org/wiki/Janusz_Brzozowski_(computer_scientist)}
   322 \url{https://en.wikipedia.org/wiki/Janusz_Brzozowski_(computer_scientist)}
   385 \end{center}
   323 \end{center}
       
   324 
       
   325 
       
   326 If you want to see how badly the regular expression matchers do in
       
   327 Java and Python with the `evil' regular expression, then have a look
       
   328 at the graphs below (you can try it out for yourself: have a look at
       
   329 the file \texttt{catastrophic.java} on KEATS). Compare this with the
       
   330 matcher you have implemented. How long can the string of $a$'s be
       
   331 in your matcher and stay within the 30 seconds time limit?
       
   332 
       
   333 \begin{center}
       
   334 \begin{tikzpicture}
       
   335 \begin{axis}[
       
   336     title={Graph: $(a^*)^*\cdot b$ and strings 
       
   337            $\underbrace{a\ldots a}_{n}$},
       
   338     xlabel={$n$},
       
   339     x label style={at={(1.05,0.0)}},
       
   340     ylabel={time in secs},
       
   341     enlargelimits=false,
       
   342     xtick={0,5,...,30},
       
   343     xmax=33,
       
   344     ymax=35,
       
   345     ytick={0,5,...,30},
       
   346     scaled ticks=false,
       
   347     axis lines=left,
       
   348     width=6cm,
       
   349     height=5.0cm, 
       
   350     legend entries={Python, Java},  
       
   351     legend pos=outer north east]
       
   352 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
       
   353 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
       
   354 \end{axis}
       
   355 \end{tikzpicture}
       
   356 \end{center}
       
   357 \newpage
       
   358 
       
   359 \subsection*{Part 2 (4 Marks)}
       
   360 
       
   361 Comming from Java or C++, you might think Scala is a quite
       
   362 esotheric programming language.  But remember, some serious companies
       
   363 have built their business on Scala. And there are far more esotheric
       
   364 languages out there. One is called \emph{brainf***}. Urban M\"uller
       
   365 developed this language in 1993.  A close relative was already
       
   366 introduced in ... by Corado B\"ohm, an Italian computer pionier, who
       
   367 unfortunately died a few months ago. One feature of brainf*** is its
       
   368 minimalistic set of instructions. It has just 8 instructions, all of
       
   369 which are single characters. Despite this minimalism, this language,
       
   370 given enough memory, has been shown to be Turing complete. In this
       
   371 part you will implement an interpreter for this language.
       
   372 
       
   373 
       
   374 
       
   375 \subsubsection*{Tasks (file bf.scala)}
       
   376 
       
   377 \begin{itemize}
       
   378 \item[(2a)] 
       
   379 
       
   380 \item[(2b)]   
       
   381 
       
   382 \item[(2c)]
       
   383 
       
   384 \end{itemize}\bigskip  
       
   385 
       
   386 
       
   387 
   386 
   388 
   387 \end{document}
   389 \end{document}
   388 
   390 
   389 
   391 
   390 %%% Local Variables: 
   392 %%% Local Variables: