--- a/cws/cw02.tex	Fri Nov 09 07:30:02 2018 +0000
+++ b/cws/cw02.tex	Thu Nov 15 03:35:38 2018 +0000
@@ -1,359 +1,157 @@
 \documentclass{article}
-\usepackage{chessboard}
-\usepackage[LSBC4,T1]{fontenc}
-\let\clipbox\relax
 \usepackage{../style}
 \usepackage{disclaimer}
+\usepackage{../langs}
 
 \begin{document}
 
-\setchessboard{smallboard,
-               zero,
-               showmover=false,
-               boardfontencoding=LSBC4,
-               hlabelformat=\arabic{ranklabel},
-               vlabelformat=\arabic{filelabel}}
 
-\mbox{}\\[-18mm]\mbox{}
+\section*{Coursework 7 (DocDiff and Danube.org)}
 
-\section*{Coursework 7 (Scala, Knight's Tour)}
-
-This coursework is worth 10\%. It is about searching and
-backtracking. The first part is due on 23 November at 11pm; the
-second, more advanced part, is due on 21 December at 11pm. You are
-asked to implement Scala programs that solve various versions of the
-\textit{Knight's Tour Problem} on a chessboard. Note the second part
-might include material you have not yet seen in the first two
-lectures. \bigskip
+This coursework is worth 10\%. The first part and second part are due
+on 22 November at 11pm; the third, more advanced part, is due on 21
+December at 11pm. You are asked to implement Scala programs for
+measuring similarity in texts and for recommending movies
+according to a ratings list.  Note the second part might include
+material you have not yet seen in the first two lectures. \bigskip
 
 \IMPORTANT{}
+
+\noindent
 Also note that the running time of each part will be restricted to a
-maximum of 360 seconds on my laptop: If you calculate a result once,
-try to avoid to calculate the result again. Feel free to copy any code
-you need from files \texttt{knight1.scala}, \texttt{knight2.scala} and
-\texttt{knight3.scala}.
+maximum of 30 seconds on my laptop.
 
 \DISCLAIMER{}
 
-\subsection*{Background}
 
-The \textit{Knight's Tour Problem} is about finding a tour such that
-the knight visits every field on an $n\times n$ chessboard once. For
-example on a $5\times 5$ chessboard, a knight's tour is:
+\subsection*{Reference Implementation}
 
-\chessboard[maxfield=d4, 
-            pgfstyle= {[base,at={\pgfpoint{0pt}{-0.5ex}}]text},
-            text = \small 24, markfield=Z4,
-            text = \small 11, markfield=a4,
-            text = \small  6, markfield=b4,
-            text = \small 17, markfield=c4,
-            text = \small  0, markfield=d4,
-            text = \small 19, markfield=Z3,
-            text = \small 16, markfield=a3,
-            text = \small 23, markfield=b3,
-            text = \small 12, markfield=c3,
-            text = \small  7, markfield=d3,
-            text = \small 10, markfield=Z2,
-            text = \small  5, markfield=a2,
-            text = \small 18, markfield=b2,
-            text = \small  1, markfield=c2,
-            text = \small 22, markfield=d2,
-            text = \small 15, markfield=Z1,
-            text = \small 20, markfield=a1,
-            text = \small  3, markfield=b1,
-            text = \small  8, markfield=c1,
-            text = \small 13, markfield=d1,
-            text = \small  4, markfield=Z0,
-            text = \small  9, markfield=a0,
-            text = \small 14, markfield=b0,
-            text = \small 21, markfield=c0,
-            text = \small  2, markfield=d0
-           ]
-           
-\noindent
-The tour starts in the right-upper corner, then moves to field
-$(3,2)$, then $(4,0)$ and so on. There are no knight's tours on
-$2\times 2$, $3\times 3$ and $4\times 4$ chessboards, but for every
-bigger board there is. 
+Like the C++ assignments, the Scala assignments will work like this: you
+push your files to GitHub and receive (after sometimes a long delay) some
+automated feedback. In the end we take a snapshot of the submitted files and
+apply an automated marking script to them.
 
-A knight's tour is called \emph{closed}, if the last step in the tour
-is within a knight's move to the beginning of the tour. So the above
-knight's tour is \underline{not} closed because the last
-step on field $(0, 4)$ is not within the reach of the first step on
-$(4, 4)$. It turns out there is no closed knight's tour on a $5\times
-5$ board. But there are on a $6\times 6$ board and on bigger ones, for
-example
+In addition, the Scala assignments come with a reference
+implementation in form of a \texttt{jar}-file. This allows you to run
+any test cases on your own computer. For example you can call Scala on
+the command line with the option \texttt{-cp docdiff.jar} and then
+query any function from the template file. Say you want to find out
+what the function \texttt{occurences} produces: for this you just need
+to prefix it with the object name \texttt{CW7a} (and \texttt{CW7b}
+respectively for \texttt{danube.jar}).  If you want to find out what
+these functions produce for the list \texttt{List("a", "b", "b")},
+you would type something like:
 
-\chessboard[maxfield=e5, 
-            pgfstyle={[base,at={\pgfpoint{0pt}{-0.5ex}}]text},
-            text = \small 10, markfield=Z5,
-            text = \small  5, markfield=a5,
-            text = \small 18, markfield=b5,
-            text = \small 25, markfield=c5,
-            text = \small 16, markfield=d5,
-            text = \small  7, markfield=e5,
-            text = \small 31, markfield=Z4,
-            text = \small 26, markfield=a4,
-            text = \small  9, markfield=b4,
-            text = \small  6, markfield=c4,
-            text = \small 19, markfield=d4,
-            text = \small 24, markfield=e4,
-            % 4  11  30  17   8  15 
-            text = \small  4, markfield=Z3,
-            text = \small 11, markfield=a3,
-            text = \small 30, markfield=b3,
-            text = \small 17, markfield=c3,
-            text = \small  8, markfield=d3,
-            text = \small 15, markfield=e3,
-            %29  32  27   0  23  20 
-            text = \small 29, markfield=Z2,
-            text = \small 32, markfield=a2,
-            text = \small 27, markfield=b2,
-            text = \small  0, markfield=c2,
-            text = \small 23, markfield=d2,
-            text = \small 20, markfield=e2,
-            %12   3  34  21  14   1 
-            text = \small 12, markfield=Z1,
-            text = \small  3, markfield=a1,
-            text = \small 34, markfield=b1,
-            text = \small 21, markfield=c1,
-            text = \small 14, markfield=d1,
-            text = \small  1, markfield=e1,
-            %33  28  13   2  35  22 
-            text = \small 33, markfield=Z0,
-            text = \small 28, markfield=a0,
-            text = \small 13, markfield=b0,
-            text = \small  2, markfield=c0,
-            text = \small 35, markfield=d0,
-            text = \small 22, markfield=e0,
-            vlabel=false,
-            hlabel=false
-           ]
+\begin{lstlisting}[language={},numbers=none,basicstyle=\ttfamily\small]
+$ scala -cp docdiff.jar
+  
+scala> CW7a.occurences(List("a", "b", "b"))
+...
+\end{lstlisting}%$
+
+\subsection*{Hints}
 
 
-\noindent
-where the 35th move can join up again with the 0th move.
-
-If you cannot remember how a knight moves in chess, or never played
-chess, below are all potential moves indicated for two knights, one on
-field $(2, 2)$ (blue moves) and another on $(7, 7)$ (red moves):
 
 
-\chessboard[maxfield=g7,
-            color=blue!50,
-            linewidth=0.2em,
-            shortenstart=0.5ex,
-            shortenend=0.5ex,
-            markstyle=cross,
-            markfields={a4, c4, Z3, d3, Z1, d1, a0, c0},
-            color=red!50,
-            markfields={f5, e6},
-            setpieces={Ng7, Nb2}]
-
-\subsection*{Part 1 (7 Marks)}
-
-You are asked to implement the knight's tour problem such that the
-dimension of the board can be changed.  Therefore most functions will
-take the dimension of the board as an argument.  The fun with this
-problem is that even for small chessboard dimensions it has already an
-incredibly large search space---finding a tour is like finding a
-needle in a haystack. In the first task we want to see how far we get
-with exhaustively exploring the complete search space for small
-chessboards.\medskip
+\newpage
+\subsection*{Part 1 (4 Marks, file docdiff.scala)}
 
-\noindent
-Let us first fix the basic datastructures for the implementation.  The
-board dimension is an integer (we will never go beyond board sizes of
-$40 \times 40$).  A \emph{position} (or field) on the chessboard is
-a pair of integers, like $(0, 0)$. A \emph{path} is a list of
-positions. The first (or 0th move) in a path is the last element in
-this list; and the last move in the path is the first element. For
-example the path for the $5\times 5$ chessboard above is represented
-by
-
-\[
-\texttt{List($\underbrace{\texttt{(0, 4)}}_{24}$,
-  $\underbrace{\texttt{(2, 3)}}_{23}$, ...,
-  $\underbrace{\texttt{(3, 2)}}_1$, $\underbrace{\texttt{(4, 4)}}_0$)}
-\]
-
-\noindent
-Suppose the dimension of a chessboard is $n$, then a path is a
-\emph{tour} if the length of the path is $n \times n$, each element
-occurs only once in the path, and each move follows the rules of how a
-knight moves (see above for the rules).
+It seems source code plagiarism---stealing someone else's code---is a
+serious problem at other universities.\footnote{Surely, King's
+  students, after all their instructions and warnings, would never
+  commit such an offence.} Dedecting such plagiarism is time-consuming
+and disheartening. To aid the poor lecturers at other universities,
+let's implement a program that determines the similarity between two
+documents (be they code or English texts). A document will be
+represented as a list of strings.
 
 
-\subsubsection*{Tasks (file knight1.scala)}
+\subsection*{Tasks}
 
 \begin{itemize}
-\item[(1a)] Implement an \texttt{is\_legal\_move} function that takes a
-  dimension, a path and a position as arguments and tests whether the
-  position is inside the board and not yet element in the
-  path. \hfill[1 Mark]
+\item[(1)] Implement a function that cleans a string by finding all
+  words in this string. For this use the regular expression
+  \texttt{"$\backslash$w+"} and the library function
+  \texttt{findAllIn}. The function should return a list of
+  strings.\\
+  \mbox{}\hfill [1 Mark]
 
-\item[(1b)] Implement a \texttt{legal\_moves} function that calculates for a
-  position all legal onward moves. If the onward moves are
-  placed on a circle, you should produce them starting from
-  ``12-o'clock'' following in clockwise order.  For example on an
-  $8\times 8$ board for a knight at position $(2, 2)$ and otherwise
-  empty board, the legal-moves function should produce the onward
-  positions in this order:
+\item[(2)] In order to compute the similarity between two documents, we
+  associate each document with a \texttt{Map}. This Map represents the
+  strings in a document and how many times these strings occur in a
+  document. A simple (though slightly inefficient) method for counting
+  the number of string-occurences in a document is as follows: remove
+  all duplicates from the document; for each of these (unique)
+  strings, count how many times they occur in the original document.
+  Return a Map from strings to occurences. For example
 
   \begin{center}
-  \texttt{List((3,4), (4,3), (4,1), (3,0), (1,0), (0,1), (0,3), (1,4))}
+  \pcode{occurences(List("a", "b", "b", "c", "d"))}
   \end{center}
 
-  If the board is not empty, then maybe some of the moves need to be
-  filtered out from this list.  For a knight on field $(7, 7)$ and an
-  empty board, the legal moves are
+  produces \pcode{Map(a -> 1, b -> 2, c -> 1, d -> 1)} and
 
   \begin{center}
-  \texttt{List((6,5), (5,6))}
+  \pcode{occurences(List("d", "b", "d", "b", "d"))}
   \end{center}
-  \mbox{}\hfill[1 Mark]
 
-\item[(1c)] Implement two recursive functions (\texttt{count\_tours} and
-  \texttt{enum\_tours}). They each take a dimension and a path as
-  arguments. They exhaustively search for tours starting
-  from the given path. The first function counts all possible 
-  tours (there can be none for certain board sizes) and the second
-  collects all tours in a list of paths.\hfill[2 Marks]
-\end{itemize}
+  produces \pcode{Map(d -> 3, b -> 2)}.\hfill[1 Mark]
 
-\noindent \textbf{Test data:} For the marking, the functions in (1c)
-will be called with board sizes up to $5 \times 5$. If you search
-for tours on a $5 \times 5$ board starting only from field $(0, 0)$,
-there are 304 of tours. If you try out every field of a $5 \times
-5$-board as a starting field and add up all tours, you obtain
-1728. A $6\times 6$ board is already too large to be searched
-exhaustively.\footnote{For your interest, the number of tours on
-  $6\times 6$, $7\times 7$ and $8\times 8$ are 6637920, 165575218320,
-  19591828170979904, respectively.}\bigskip
+\item[(3)] You can think of the Maps calculated under (2) as efficient
+  representations of sparse ``vectors''. In this subtask you need to
+  implement the \emph{product} of two vectors, sometimes also called
+  \emph{dot product}.\footnote{\url{https://en.wikipedia.org/wiki/Dot_product}}
 
-\noindent
-\textbf{Hints:} useful list functions: \texttt{.contains(..)} checks
-whether an element is in a list, \texttt{.flatten} turns a list of
-lists into just a list, \texttt{\_::\_} puts an element on the head of
-the list, \texttt{.head} gives you the first element of a list (make
-sure the list is not \texttt{Nil}).
-
-\subsubsection*{Tasks (file knight2.scala)}
-
-\begin{itemize}
-\item[(2a)] Implement a \texttt{first}-function. This function takes a list of
-  positions and a function $f$ as arguments; $f$ is the name we give to
-  this argument). The function $f$ takes a position as argument and
-  produces an optional path. So $f$'s type is \texttt{Pos =>
-    Option[Path]}. The idea behind the \texttt{first}-function is as follows:
+  For this implement a function that takes two documents
+  (\texttt{List[String]}) as arguments. The function first calculates
+  the (unique) strings in both. For each string, it multiplies the
+  occurences in each document. If a string does not occur in one of the
+  documents, then the product is zero. At the end you
+  sum all products. For the two documents in (2) the dot product is 7:
 
   \[
-  \begin{array}{lcl}
-  \textit{first}(\texttt{Nil}, f) & \dn & \texttt{None}\\  
-  \textit{first}(x\!::\!xs, f) & \dn & \begin{cases}
-    f(x) & \textit{if}\;f(x) \not=\texttt{None}\\
-    \textit{first}(xs, f) & \textit{otherwise}\\
-                              \end{cases}
-  \end{array}
+    \underbrace{1 * 0}_{"a"} \;\;+\;\;
+    \underbrace{2 * 2}_{"b"} \;\;+\;\;
+    \underbrace{1 * 0}_{"c"} \;\;+\;\;
+    \underbrace{1 * 3}_{"d"}
+  \]  
+  
+  \hfill\mbox{[1 Mark]}
+
+\item[(4)] Implement first a function that calculates the overlap
+  between two documents, say $d_1$ and $d_2$, according to the formula
+
+  \[
+  \texttt{overlap}(d_1, d_2) = \frac{d_1 \cdot d_2}{max(d_1^2, d_2^2)}  
   \]
 
-  \noindent That is, we want to find the first position where the
-  result of $f$ is not \texttt{None}, if there is one. Note that
-  `inside' \texttt{first}, you do not (need to) know anything about
-  the argument $f$ except its type, namely \texttt{Pos =>
-    Option[Path]}. There is one additional point however you should
-  take into account when implementing \texttt{first}: you will need to
-  calculate what the result of $f(x)$ is; your code should do this
-  only \textbf{once} and for as \textbf{few} elements in the list as
-  possible! Do not calculate $f(x)$ for all elements and then see which 
-  is the first \texttt{Some}.\\\mbox{}\hfill[1 Mark]
-  
-\item[(2b)] Implement a \texttt{first\_tour} function that uses the
-  \texttt{first}-function from (2a), and searches recursively for a tour.
-  As there might not be such a tour at all, the \texttt{first\_tour} function
-  needs to return a value of type
-  \texttt{Option[Path]}.\\\mbox{}\hfill[2 Marks]
+  This function should return a \texttt{Double} between 0 and 1. The
+  overlap between the lists in (2) is $0.5384615384615384$.
+
+  Second implement a function that calculates the similarity of
+  two strings, by first extracting the strings using the function from (1)
+  and then calculating the overlap.
+  \hfill\mbox{[1 Mark]}
 \end{itemize}
 
-\noindent
-\textbf{Testing:} The \texttt{first\_tour} function will be called with board
-sizes of up to $8 \times 8$.
-\bigskip
-
-\noindent
-\textbf{Hints:} a useful list function: \texttt{.filter(..)} filters a
-list according to a boolean function; a useful option function:
-\texttt{.isDefined} returns true, if an option is \texttt{Some(..)};
-anonymous functions can be constructed using \texttt{(x:Int) => ...},
-this functions takes an \texttt{Int} as an argument.
 
 
-%%\newpage
-\subsection*{Part 2 (3 Marks)}
-
-As you should have seen in Part 1, a naive search for tours beyond
-$8 \times 8$ boards and also searching for closed tours even on small
-boards takes too much time. There is a heuristic, called \emph{Warnsdorf's
-Rule} that can speed up finding a tour. This heuristic states that a
-knight is moved so that it always proceeds to the field from which the
-knight will have the \underline{fewest} onward moves.  For example for
-a knight on field $(1, 3)$, the field $(0, 1)$ has the fewest possible
-onward moves, namely 2.
+\newpage
+You are creating Danube.org, which you hope will be the next big thing
+in online movie provider. You know that you can save money by
+anticipating what movies people will rent; you will pass these savings
+on to your users by offering a discount if they rent movies that Danube.org
+recommends.  This assignment is meant to calculate 
 
-\chessboard[maxfield=g7,
-            pgfstyle= {[base,at={\pgfpoint{0pt}{-0.5ex}}]text},
-            text = \small 3, markfield=Z5,
-            text = \small 7, markfield=b5,
-            text = \small 7, markfield=c4,
-            text = \small 7, markfield=c2,
-            text = \small 5, markfield=b1,
-            text = \small 2, markfield=Z1,
-            setpieces={Na3}]
-
-\noindent
-Warnsdorf's Rule states that the moves on the board above should be
-tried in the order
-
-\[
-(0, 1), (0, 5), (2, 1), (2, 5), (3, 4), (3, 2)
-\]
 
-\noindent
-Whenever there are ties, the corresponding onward moves can be in any
-order.  When calculating the number of onward moves for each field, we
-do not count moves that revisit any field already visited.
-
-\subsubsection*{Tasks (file knight3.scala)}
+To do this, you offer an incentive for people to upload their lists of
+recommended books. From their lists, you can establish suggested
+pairs. A pair of books is a suggested pair if both books appear on one
+person’s recommendation list. Of course, some suggested pairs are more
+popular than others. Also, any given book is paired with some books
+much more frequently than with others.
 
-\begin{itemize}
-\item[(3a)] Write a function \texttt{ordered\_moves} that calculates a list of
-  onward moves like in (1b) but orders them according to the
-  Warnsdorf’s Rule. That means moves with the fewest legal onward moves
-  should come first (in order to be tried out first). \hfill[1 Mark]
-  
-\item[(3b)] Implement a \texttt{first\_closed-tour\_heuristic}
-  function that searches for a
-  \textbf{closed} tour on a $6\times 6$ board. It should use the
-  \texttt{first}-function from (2a) and tries out onward moves according to
-  the \texttt{ordered\_moves} function from (3a). It is more likely to find
-  a solution when started in the middle of the board (that is
-  position $(dimension / 2, dimension / 2)$). \hfill[1 Mark]
-
-\item[(3c)] Implement a \texttt{first\_tour\_heuristic} function
-  for boards up to
-  $40\times 40$.  It is the same function as in (3b) but searches for
-  tours (not just closed tours). You have to be careful to write a
-  tail-recursive function of the \texttt{first\_tour\_heuristic} function
-  otherwise you will get problems with stack-overflows.\\
-  \mbox{}\hfill[1 Mark]
-\end{itemize}  
-\bigskip
-
-\noindent
-\textbf{Hints:} a useful list function: \texttt{.sortBy} sorts a list
-according to a component given by the function; a function can be
-tested to be tail recursive by annotation \texttt{@tailrec}, which is
-made available by importing \texttt{scala.annotation.tailrec}.
changeset 202	f7bcb27d1940
parent 166	780c40aaad27
child 203	eb188f9ac038