cws/cw03.tex
author Christian Urban <urbanc@in.tum.de>
Thu, 23 Nov 2017 10:56:47 +0000
changeset 153 4383809c176a
parent 152 114a89518aea
child 154 39c6b93718f0
permissions -rw-r--r--
updated

\documentclass{article}
\usepackage{../style}
\usepackage{../langs}
\usepackage{tikz}
\usepackage{pgf}
\usepackage{pgfplots}

\begin{filecontents}{re-python2.data}
1 0.033
5 0.036
10 0.034
15 0.036
18 0.059
19 0.084
20 0.141
21 0.248
22 0.485
23 0.878
24 1.71
25 3.40
26 7.08
27 14.12
28 26.69
\end{filecontents}

\begin{filecontents}{re-java.data}
5  0.00298
10  0.00418
15  0.00996
16  0.01710
17  0.03492
18  0.03303
19  0.05084
20  0.10177
21  0.19960
22  0.41159
23  0.82234
24  1.70251
25  3.36112
26  6.63998
27  13.35120
28  29.81185
\end{filecontents}


\begin{document}

\section*{Coursework 8 (Scala, Regular Expressions, Brainf***)}

This coursework is worth 10\%. It is about regular expressions,
pattern matching and an interpreter. The first part is due on 30
November at 11pm; the second, more advanced part, is due on 21
December at 11pm. In the first part, you are asked to implement a
regular expression matcher based on derivatives of regular
expressions. The reason is that regular expression matching in Java
can sometimes be extremely slow. The advanced part is about an
interpreter for a very simple programming language.\bigskip

\noindent
\textbf{Important:}

\begin{itemize}
\item Make sure the files you submit can be processed by just calling\\
  \mbox{\texttt{scala <<filename.scala>>}} on the commandline. Use the
  template files provided and do not make any changes to arguments of
  functions or to any types. You are free to implement any auxiliary
  function you might need.

\item Do not use any mutable data structures in your
submissions! They are not needed. This means you cannot create new 
\texttt{Array}s or \texttt{ListBuffer}s, for example. 

\item Do not use \texttt{return} in your code! It has a different
  meaning in Scala, than in Java.

\item Do not use \texttt{var}! This declares a mutable variable. Only
  use \texttt{val}!

\item Do not use any parallel collections! No \texttt{.par} therefore!
  Our testing and marking infrastructure is not set up for it.
\end{itemize}

\noindent
Also note that the running time of each part will be restricted to a
maximum of 360 seconds on my laptop


\subsection*{Disclaimer}

It should be understood that the work you submit represents
your own effort! You have not copied from anyone else. An
exception is the Scala code I showed during the lectures or
uploaded to KEATS, which you can freely use.\bigskip


\subsection*{Part 1 (6 Marks)}

The task is to implement a regular expression matcher that is based on
derivatives of regular expressions. Most of the functions are defined by
recursion over regular expressions and can be elegantly implemented
using Scala's pattern-matching. The implementation should deal with the
following regular expressions, which have been predefined in the file
\texttt{re.scala}:

\begin{center}
\begin{tabular}{lcll}
  $r$ & $::=$ & $\ZERO$     & cannot match anything\\
      &   $|$ & $\ONE$      & can only match the empty string\\
      &   $|$ & $c$         & can match a character (in this case $c$)\\
      &   $|$ & $r_1 + r_2$ & can match a string either with $r_1$ or with $r_2$\\
  &   $|$ & $r_1\cdot r_2$ & can match the first part of a string with $r_1$ and\\
          &  & & then the second part with $r_2$\\
      &   $|$ & $r^*$       & can match zero or more times $r$\\
\end{tabular}
\end{center}

\noindent 
Why? Knowing how to match regular expressions and strings will let you
solve a lot of problems that vex other humans. Regular expressions are
one of the fastest and simplest ways to match patterns in text, and
are endlessly useful for searching, editing and analysing data in all
sorts of places (for example analysing network traffic in order to
detect security breaches). However, you need to be fast, otherwise you
will stumble over problems such as recently reported at

{\small
\begin{itemize}
\item[$\bullet$] \url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}
\item[$\bullet$] \url{https://vimeo.com/112065252}
\item[$\bullet$] \url{http://davidvgalbraith.com/how-i-fixed-atom/}  
\end{itemize}}

\subsubsection*{Tasks (file re.scala)}

\begin{itemize}
\item[(1a)] Implement a function, called \textit{nullable}, by
  recursion over regular expressions. This function tests whether a
  regular expression can match the empty string. This means given a
  regular expression it either returns true or false.

\begin{center}
\begin{tabular}{lcl}
$\textit{nullable}(\ZERO)$ & $\dn$ & $\textit{false}$\\
$\textit{nullable}(\ONE)$  & $\dn$ & $\textit{true}$\\
$\textit{nullable}(c)$     & $\dn$ & $\textit{false}$\\
$\textit{nullable}(r_1 + r_2)$ & $\dn$ & $\textit{nullable}(r_1) \vee \textit{nullable}(r_2)$\\
$\textit{nullable}(r_1 \cdot r_2)$ & $\dn$ & $\textit{nullable}(r_1) \wedge \textit{nullable}(r_2)$\\
$\textit{nullable}(r^*)$ & $\dn$ & $\textit{true}$\\
\end{tabular}
\end{center}\hfill[1 Mark]

\item[(1b)] Implement a function, called \textit{der}, by recursion over
  regular expressions. It takes a character and a regular expression
  as arguments and calculates the derivative regular expression according
  to the rules:

\begin{center}
\begin{tabular}{lcl}
$\textit{der}\;c\;(\ZERO)$ & $\dn$ & $\ZERO$\\
$\textit{der}\;c\;(\ONE)$  & $\dn$ & $\ZERO$\\
$\textit{der}\;c\;(d)$     & $\dn$ & $\textit{if}\; c = d\;\textit{then} \;\ONE \; \textit{else} \;\ZERO$\\
$\textit{der}\;c\;(r_1 + r_2)$ & $\dn$ & $(\textit{der}\;c\;r_1) + (\textit{der}\;c\;r_2)$\\
$\textit{der}\;c\;(r_1 \cdot r_2)$ & $\dn$ & $\textit{if}\;\textit{nullable}(r_1)$\\
      & & $\textit{then}\;((\textit{der}\;c\;r_1)\cdot r_2) + (\textit{der}\;c\;r_2)$\\
      & & $\textit{else}\;(\textit{der}\;c\;r_1)\cdot r_2$\\
$\textit{der}\;c\;(r^*)$ & $\dn$ & $(\textit{der}\;c\;r)\cdot (r^*)$\\
\end{tabular}
\end{center}

For example given the regular expression $r = (a \cdot b) \cdot c$, the derivatives
w.r.t.~the characters $a$, $b$ and $c$ are

\begin{center}
  \begin{tabular}{lcll}
    $\textit{der}\;a\;r$ & $=$ & $(\ONE \cdot b)\cdot c$ & ($= r'$)\\
    $\textit{der}\;b\;r$ & $=$ & $(\ZERO \cdot b)\cdot c$\\
    $\textit{der}\;c\;r$ & $=$ & $(\ZERO \cdot b)\cdot c$
  \end{tabular}
\end{center}

Let $r'$ stand for the first derivative, then taking the derivatives of $r'$
w.r.t.~the characters $a$, $b$ and $c$ gives

\begin{center}
  \begin{tabular}{lcll}
    $\textit{der}\;a\;r'$ & $=$ & $((\ZERO \cdot b) + \ZERO)\cdot c$ \\
    $\textit{der}\;b\;r'$ & $=$ & $((\ZERO \cdot b) + \ONE)\cdot c$ & ($= r''$)\\
    $\textit{der}\;c\;r'$ & $=$ & $((\ZERO \cdot b) + \ZERO)\cdot c$
  \end{tabular}
\end{center}

One more example: Let $r''$ stand for the second derivative above,
then taking the derivatives of $r''$ w.r.t.~the characters $a$, $b$
and $c$ gives

\begin{center}
  \begin{tabular}{lcll}
    $\textit{der}\;a\;r''$ & $=$ & $((\ZERO \cdot b) + \ZERO) \cdot c + \ZERO$ \\
    $\textit{der}\;b\;r''$ & $=$ & $((\ZERO \cdot b) + \ZERO) \cdot c + \ZERO$\\
    $\textit{der}\;c\;r''$ & $=$ & $((\ZERO \cdot b) + \ZERO) \cdot c + \ONE$ &
    (is $\textit{nullable}$)                      
  \end{tabular}
\end{center}

Note, the last derivative can match the empty string, that is it is \textit{nullable}.\\
\mbox{}\hfill\mbox{[1 Mark]}

\item[(1c)] Implement the function \textit{simp}, which recursively
  traverses a regular expression from the inside to the outside, and
  on the way simplifies every regular expression on the left (see
  below) to the regular expression on the right, except it does not
  simplify inside ${}^*$-regular expressions.

  \begin{center}
\begin{tabular}{l@{\hspace{4mm}}c@{\hspace{4mm}}ll}
$r \cdot \ZERO$ & $\mapsto$ & $\ZERO$\\ 
$\ZERO \cdot r$ & $\mapsto$ & $\ZERO$\\ 
$r \cdot \ONE$ & $\mapsto$ & $r$\\ 
$\ONE \cdot r$ & $\mapsto$ & $r$\\ 
$r + \ZERO$ & $\mapsto$ & $r$\\ 
$\ZERO + r$ & $\mapsto$ & $r$\\ 
$r + r$ & $\mapsto$ & $r$\\ 
\end{tabular}
  \end{center}

  For example the regular expression
  \[(r_1 + \ZERO) \cdot \ONE + ((\ONE + r_2) + r_3) \cdot (r_4 \cdot \ZERO)\]

  simplifies to just $r_1$. \textbf{Hint:} Regular expressions can be
  seen as trees and there are several methods for traversing
  trees. One of them corresponds to the inside-out traversal, which is
  sometimes also called post-order traversal. Furthermore,
  remember numerical expressions from school times: there you had expressions
  like $u + \ldots + (1 \cdot x) - \ldots (z + (y \cdot 0)) \ldots$
  and simplification rules that looked very similar to rules
  above. You would simplify such numerical expressions by replacing
  for example the $y \cdot 0$ by $0$, or $1\cdot x$ by $x$, and then
  look whether more rules are applicable. If you organise the
  simplification in an inside-out fashion, it is always clear which
  rule should be applied next.\hfill[2 Marks]

\item[(1d)] Implement two functions: The first, called \textit{ders},
  takes a list of characters and a regular expression as arguments, and
  builds the derivative w.r.t.~the list as follows:

\begin{center}
\begin{tabular}{lcl}
$\textit{ders}\;(Nil)\;r$ & $\dn$ & $r$\\
  $\textit{ders}\;(c::cs)\;r$  & $\dn$ &
    $\textit{ders}\;cs\;(\textit{simp}(\textit{der}\;c\;r))$\\
\end{tabular}
\end{center}

Note that this function is different from \textit{der}, which only
takes a single character.

The second function, called \textit{matcher}, takes a string and a
regular expression as arguments. It builds first the derivatives
according to \textit{ders} and after that tests whether the resulting
derivative regular expression can match the empty string (using
\textit{nullable}).  For example the \textit{matcher} will produce
true for the regular expression $(a\cdot b)\cdot c$ and the string
$abc$, but false if you give it the string $ab$. \hfill[1 Mark]

\item[(1e)] Implement a function, called \textit{size}, by recursion
  over regular expressions. If a regular expression is seen as a tree,
  then \textit{size} should return the number of nodes in such a
  tree. Therefore this function is defined as follows:

\begin{center}
\begin{tabular}{lcl}
$\textit{size}(\ZERO)$ & $\dn$ & $1$\\
$\textit{size}(\ONE)$  & $\dn$ & $1$\\
$\textit{size}(c)$     & $\dn$ & $1$\\
$\textit{size}(r_1 + r_2)$ & $\dn$ & $1 + \textit{size}(r_1) + \textit{size}(r_2)$\\
$\textit{size}(r_1 \cdot r_2)$ & $\dn$ & $1 + \textit{size}(r_1) + \textit{size}(r_2)$\\
$\textit{size}(r^*)$ & $\dn$ & $1 + \textit{size}(r)$\\
\end{tabular}
\end{center}

You can use \textit{size} in order to test how much the `evil' regular
expression $(a^*)^* \cdot b$ grows when taking successive derivatives
according the letter $a$ without simplification and then compare it to
taking the derivative, but simplify the result.  The sizes
are given in \texttt{re.scala}. \hfill[1 Mark]
\end{itemize}

\subsection*{Background}

Although easily implementable in Scala, the idea behind the derivative
function might not so easy to be seen. To understand its purpose
better, assume a regular expression $r$ can match strings of the form
$c\!::\!cs$ (that means strings which start with a character $c$ and have
some rest, or tail, $cs$). If you take the derivative of $r$ with
respect to the character $c$, then you obtain a regular expression
that can match all the strings $cs$.  In other words, the regular
expression $\textit{der}\;c\;r$ can match the same strings $c\!::\!cs$
that can be matched by $r$, except that the $c$ is chopped off.

Assume now $r$ can match the string $abc$. If you take the derivative
according to $a$ then you obtain a regular expression that can match
$bc$ (it is $abc$ where the $a$ has been chopped off). If you now
build the derivative $\textit{der}\;b\;(\textit{der}\;a\;r)$ you
obtain a regular expression that can match the string $c$ (it is $bc$
where $b$ is chopped off). If you finally build the derivative of this
according $c$, that is
$\textit{der}\;c\;(\textit{der}\;b\;(\textit{der}\;a\;r))$, you obtain
a regular expression that can match the empty string. You can test
whether this is indeed the case using the function nullable, which is
what your matcher is doing.

The purpose of the $\textit{simp}$ function is to keep the regular
expression small. Normally the derivative function makes the regular
expression bigger (see the SEQ case and the example in (1b)) and the
algorithm would be slower and slower over time. The $\textit{simp}$
function counters this increase in size and the result is that the
algorithm is fast throughout.  By the way, this algorithm is by Janusz
Brzozowski who came up with the idea of derivatives in 1964 in his PhD
thesis.

\begin{center}\small
\url{https://en.wikipedia.org/wiki/Janusz_Brzozowski_(computer_scientist)}
\end{center}


If you want to see how badly the regular expression matchers do in
Java and Python with the `evil' regular expression, then have a look
at the graphs below (you can try it out for yourself: have a look at
the file \texttt{catastrophic.java} on KEATS). Compare this with the
matcher you have implemented. How long can the string of $a$'s be
in your matcher and stay within the 30 seconds time limit?

\begin{center}
\begin{tikzpicture}
\begin{axis}[
    title={Graph: $(a^*)^*\cdot b$ and strings 
           $\underbrace{a\ldots a}_{n}$},
    xlabel={$n$},
    x label style={at={(1.05,0.0)}},
    ylabel={time in secs},
    enlargelimits=false,
    xtick={0,5,...,30},
    xmax=33,
    ymax=35,
    ytick={0,5,...,30},
    scaled ticks=false,
    axis lines=left,
    width=6cm,
    height=5.0cm, 
    legend entries={Python, Java},  
    legend pos=outer north east]
\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
\end{axis}
\end{tikzpicture}
\end{center}
\newpage

\subsection*{Part 2 (4 Marks)}

Comming from Java or C++, you might think Scala is a quite
esotheric programming language.  But remember, some serious companies
have built their business on Scala. And there are far more esotheric
languages out there. One is called \emph{brainf***}. Urban M\"uller
developed this language in 1993.  A close relative was already
introduced in ... by Corado B\"ohm, an Italian computer pionier, who
unfortunately died a few months ago. One feature of brainf*** is its
minimalistic set of instructions. It has just 8 instructions, all of
which are single characters. Despite this minimalism, this language,
given enough memory, has been shown to be Turing complete. In this
part you will implement an interpreter for this language.



\subsubsection*{Tasks (file bf.scala)}

\begin{itemize}
\item[(2a)] 

\item[(2b)]   

\item[(2c)]

\end{itemize}\bigskip  




\end{document}


%%% Local Variables: 
%%% mode: latex
%%% TeX-master: t
%%% End: