--- a/cws/cw02.tex Fri Nov 09 07:30:02 2018 +0000
+++ b/cws/cw02.tex Thu Nov 15 03:35:38 2018 +0000
@@ -1,359 +1,157 @@
\documentclass{article}
-\usepackage{chessboard}
-\usepackage[LSBC4,T1]{fontenc}
-\let\clipbox\relax
\usepackage{../style}
\usepackage{disclaimer}
+\usepackage{../langs}
\begin{document}
-\setchessboard{smallboard,
- zero,
- showmover=false,
- boardfontencoding=LSBC4,
- hlabelformat=\arabic{ranklabel},
- vlabelformat=\arabic{filelabel}}
-\mbox{}\\[-18mm]\mbox{}
+\section*{Coursework 7 (DocDiff and Danube.org)}
-\section*{Coursework 7 (Scala, Knight's Tour)}
-
-This coursework is worth 10\%. It is about searching and
-backtracking. The first part is due on 23 November at 11pm; the
-second, more advanced part, is due on 21 December at 11pm. You are
-asked to implement Scala programs that solve various versions of the
-\textit{Knight's Tour Problem} on a chessboard. Note the second part
-might include material you have not yet seen in the first two
-lectures. \bigskip
+This coursework is worth 10\%. The first part and second part are due
+on 22 November at 11pm; the third, more advanced part, is due on 21
+December at 11pm. You are asked to implement Scala programs for
+measuring similarity in texts and for recommending movies
+according to a ratings list. Note the second part might include
+material you have not yet seen in the first two lectures. \bigskip
\IMPORTANT{}
+
+\noindent
Also note that the running time of each part will be restricted to a
-maximum of 360 seconds on my laptop: If you calculate a result once,
-try to avoid to calculate the result again. Feel free to copy any code
-you need from files \texttt{knight1.scala}, \texttt{knight2.scala} and
-\texttt{knight3.scala}.
+maximum of 30 seconds on my laptop.
\DISCLAIMER{}
-\subsection*{Background}
-The \textit{Knight's Tour Problem} is about finding a tour such that
-the knight visits every field on an $n\times n$ chessboard once. For
-example on a $5\times 5$ chessboard, a knight's tour is:
+\subsection*{Reference Implementation}
-\chessboard[maxfield=d4,
- pgfstyle= {[base,at={\pgfpoint{0pt}{-0.5ex}}]text},
- text = \small 24, markfield=Z4,
- text = \small 11, markfield=a4,
- text = \small 6, markfield=b4,
- text = \small 17, markfield=c4,
- text = \small 0, markfield=d4,
- text = \small 19, markfield=Z3,
- text = \small 16, markfield=a3,
- text = \small 23, markfield=b3,
- text = \small 12, markfield=c3,
- text = \small 7, markfield=d3,
- text = \small 10, markfield=Z2,
- text = \small 5, markfield=a2,
- text = \small 18, markfield=b2,
- text = \small 1, markfield=c2,
- text = \small 22, markfield=d2,
- text = \small 15, markfield=Z1,
- text = \small 20, markfield=a1,
- text = \small 3, markfield=b1,
- text = \small 8, markfield=c1,
- text = \small 13, markfield=d1,
- text = \small 4, markfield=Z0,
- text = \small 9, markfield=a0,
- text = \small 14, markfield=b0,
- text = \small 21, markfield=c0,
- text = \small 2, markfield=d0
- ]
-
-\noindent
-The tour starts in the right-upper corner, then moves to field
-$(3,2)$, then $(4,0)$ and so on. There are no knight's tours on
-$2\times 2$, $3\times 3$ and $4\times 4$ chessboards, but for every
-bigger board there is.
+Like the C++ assignments, the Scala assignments will work like this: you
+push your files to GitHub and receive (after sometimes a long delay) some
+automated feedback. In the end we take a snapshot of the submitted files and
+apply an automated marking script to them.
-A knight's tour is called \emph{closed}, if the last step in the tour
-is within a knight's move to the beginning of the tour. So the above
-knight's tour is \underline{not} closed because the last
-step on field $(0, 4)$ is not within the reach of the first step on
-$(4, 4)$. It turns out there is no closed knight's tour on a $5\times
-5$ board. But there are on a $6\times 6$ board and on bigger ones, for
-example
+In addition, the Scala assignments come with a reference
+implementation in form of a \texttt{jar}-file. This allows you to run
+any test cases on your own computer. For example you can call Scala on
+the command line with the option \texttt{-cp docdiff.jar} and then
+query any function from the template file. Say you want to find out
+what the function \texttt{occurences} produces: for this you just need
+to prefix it with the object name \texttt{CW7a} (and \texttt{CW7b}
+respectively for \texttt{danube.jar}). If you want to find out what
+these functions produce for the list \texttt{List("a", "b", "b")},
+you would type something like:
-\chessboard[maxfield=e5,
- pgfstyle={[base,at={\pgfpoint{0pt}{-0.5ex}}]text},
- text = \small 10, markfield=Z5,
- text = \small 5, markfield=a5,
- text = \small 18, markfield=b5,
- text = \small 25, markfield=c5,
- text = \small 16, markfield=d5,
- text = \small 7, markfield=e5,
- text = \small 31, markfield=Z4,
- text = \small 26, markfield=a4,
- text = \small 9, markfield=b4,
- text = \small 6, markfield=c4,
- text = \small 19, markfield=d4,
- text = \small 24, markfield=e4,
- % 4 11 30 17 8 15
- text = \small 4, markfield=Z3,
- text = \small 11, markfield=a3,
- text = \small 30, markfield=b3,
- text = \small 17, markfield=c3,
- text = \small 8, markfield=d3,
- text = \small 15, markfield=e3,
- %29 32 27 0 23 20
- text = \small 29, markfield=Z2,
- text = \small 32, markfield=a2,
- text = \small 27, markfield=b2,
- text = \small 0, markfield=c2,
- text = \small 23, markfield=d2,
- text = \small 20, markfield=e2,
- %12 3 34 21 14 1
- text = \small 12, markfield=Z1,
- text = \small 3, markfield=a1,
- text = \small 34, markfield=b1,
- text = \small 21, markfield=c1,
- text = \small 14, markfield=d1,
- text = \small 1, markfield=e1,
- %33 28 13 2 35 22
- text = \small 33, markfield=Z0,
- text = \small 28, markfield=a0,
- text = \small 13, markfield=b0,
- text = \small 2, markfield=c0,
- text = \small 35, markfield=d0,
- text = \small 22, markfield=e0,
- vlabel=false,
- hlabel=false
- ]
+\begin{lstlisting}[language={},numbers=none,basicstyle=\ttfamily\small]
+$ scala -cp docdiff.jar
+
+scala> CW7a.occurences(List("a", "b", "b"))
+...
+\end{lstlisting}%$
+
+\subsection*{Hints}
-\noindent
-where the 35th move can join up again with the 0th move.
-
-If you cannot remember how a knight moves in chess, or never played
-chess, below are all potential moves indicated for two knights, one on
-field $(2, 2)$ (blue moves) and another on $(7, 7)$ (red moves):
-\chessboard[maxfield=g7,
- color=blue!50,
- linewidth=0.2em,
- shortenstart=0.5ex,
- shortenend=0.5ex,
- markstyle=cross,
- markfields={a4, c4, Z3, d3, Z1, d1, a0, c0},
- color=red!50,
- markfields={f5, e6},
- setpieces={Ng7, Nb2}]
-
-\subsection*{Part 1 (7 Marks)}
-
-You are asked to implement the knight's tour problem such that the
-dimension of the board can be changed. Therefore most functions will
-take the dimension of the board as an argument. The fun with this
-problem is that even for small chessboard dimensions it has already an
-incredibly large search space---finding a tour is like finding a
-needle in a haystack. In the first task we want to see how far we get
-with exhaustively exploring the complete search space for small
-chessboards.\medskip
+\newpage
+\subsection*{Part 1 (4 Marks, file docdiff.scala)}
-\noindent
-Let us first fix the basic datastructures for the implementation. The
-board dimension is an integer (we will never go beyond board sizes of
-$40 \times 40$). A \emph{position} (or field) on the chessboard is
-a pair of integers, like $(0, 0)$. A \emph{path} is a list of
-positions. The first (or 0th move) in a path is the last element in
-this list; and the last move in the path is the first element. For
-example the path for the $5\times 5$ chessboard above is represented
-by
-
-\[
-\texttt{List($\underbrace{\texttt{(0, 4)}}_{24}$,
- $\underbrace{\texttt{(2, 3)}}_{23}$, ...,
- $\underbrace{\texttt{(3, 2)}}_1$, $\underbrace{\texttt{(4, 4)}}_0$)}
-\]
-
-\noindent
-Suppose the dimension of a chessboard is $n$, then a path is a
-\emph{tour} if the length of the path is $n \times n$, each element
-occurs only once in the path, and each move follows the rules of how a
-knight moves (see above for the rules).
+It seems source code plagiarism---stealing someone else's code---is a
+serious problem at other universities.\footnote{Surely, King's
+ students, after all their instructions and warnings, would never
+ commit such an offence.} Dedecting such plagiarism is time-consuming
+and disheartening. To aid the poor lecturers at other universities,
+let's implement a program that determines the similarity between two
+documents (be they code or English texts). A document will be
+represented as a list of strings.
-\subsubsection*{Tasks (file knight1.scala)}
+\subsection*{Tasks}
\begin{itemize}
-\item[(1a)] Implement an \texttt{is\_legal\_move} function that takes a
- dimension, a path and a position as arguments and tests whether the
- position is inside the board and not yet element in the
- path. \hfill[1 Mark]
+\item[(1)] Implement a function that cleans a string by finding all
+ words in this string. For this use the regular expression
+ \texttt{"$\backslash$w+"} and the library function
+ \texttt{findAllIn}. The function should return a list of
+ strings.\\
+ \mbox{}\hfill [1 Mark]
-\item[(1b)] Implement a \texttt{legal\_moves} function that calculates for a
- position all legal onward moves. If the onward moves are
- placed on a circle, you should produce them starting from
- ``12-o'clock'' following in clockwise order. For example on an
- $8\times 8$ board for a knight at position $(2, 2)$ and otherwise
- empty board, the legal-moves function should produce the onward
- positions in this order:
+\item[(2)] In order to compute the similarity between two documents, we
+ associate each document with a \texttt{Map}. This Map represents the
+ strings in a document and how many times these strings occur in a
+ document. A simple (though slightly inefficient) method for counting
+ the number of string-occurences in a document is as follows: remove
+ all duplicates from the document; for each of these (unique)
+ strings, count how many times they occur in the original document.
+ Return a Map from strings to occurences. For example
\begin{center}
- \texttt{List((3,4), (4,3), (4,1), (3,0), (1,0), (0,1), (0,3), (1,4))}
+ \pcode{occurences(List("a", "b", "b", "c", "d"))}
\end{center}
- If the board is not empty, then maybe some of the moves need to be
- filtered out from this list. For a knight on field $(7, 7)$ and an
- empty board, the legal moves are
+ produces \pcode{Map(a -> 1, b -> 2, c -> 1, d -> 1)} and
\begin{center}
- \texttt{List((6,5), (5,6))}
+ \pcode{occurences(List("d", "b", "d", "b", "d"))}
\end{center}
- \mbox{}\hfill[1 Mark]
-\item[(1c)] Implement two recursive functions (\texttt{count\_tours} and
- \texttt{enum\_tours}). They each take a dimension and a path as
- arguments. They exhaustively search for tours starting
- from the given path. The first function counts all possible
- tours (there can be none for certain board sizes) and the second
- collects all tours in a list of paths.\hfill[2 Marks]
-\end{itemize}
+ produces \pcode{Map(d -> 3, b -> 2)}.\hfill[1 Mark]
-\noindent \textbf{Test data:} For the marking, the functions in (1c)
-will be called with board sizes up to $5 \times 5$. If you search
-for tours on a $5 \times 5$ board starting only from field $(0, 0)$,
-there are 304 of tours. If you try out every field of a $5 \times
-5$-board as a starting field and add up all tours, you obtain
-1728. A $6\times 6$ board is already too large to be searched
-exhaustively.\footnote{For your interest, the number of tours on
- $6\times 6$, $7\times 7$ and $8\times 8$ are 6637920, 165575218320,
- 19591828170979904, respectively.}\bigskip
+\item[(3)] You can think of the Maps calculated under (2) as efficient
+ representations of sparse ``vectors''. In this subtask you need to
+ implement the \emph{product} of two vectors, sometimes also called
+ \emph{dot product}.\footnote{\url{https://en.wikipedia.org/wiki/Dot_product}}
-\noindent
-\textbf{Hints:} useful list functions: \texttt{.contains(..)} checks
-whether an element is in a list, \texttt{.flatten} turns a list of
-lists into just a list, \texttt{\_::\_} puts an element on the head of
-the list, \texttt{.head} gives you the first element of a list (make
-sure the list is not \texttt{Nil}).
-
-\subsubsection*{Tasks (file knight2.scala)}
-
-\begin{itemize}
-\item[(2a)] Implement a \texttt{first}-function. This function takes a list of
- positions and a function $f$ as arguments; $f$ is the name we give to
- this argument). The function $f$ takes a position as argument and
- produces an optional path. So $f$'s type is \texttt{Pos =>
- Option[Path]}. The idea behind the \texttt{first}-function is as follows:
+ For this implement a function that takes two documents
+ (\texttt{List[String]}) as arguments. The function first calculates
+ the (unique) strings in both. For each string, it multiplies the
+ occurences in each document. If a string does not occur in one of the
+ documents, then the product is zero. At the end you
+ sum all products. For the two documents in (2) the dot product is 7:
\[
- \begin{array}{lcl}
- \textit{first}(\texttt{Nil}, f) & \dn & \texttt{None}\\
- \textit{first}(x\!::\!xs, f) & \dn & \begin{cases}
- f(x) & \textit{if}\;f(x) \not=\texttt{None}\\
- \textit{first}(xs, f) & \textit{otherwise}\\
- \end{cases}
- \end{array}
+ \underbrace{1 * 0}_{"a"} \;\;+\;\;
+ \underbrace{2 * 2}_{"b"} \;\;+\;\;
+ \underbrace{1 * 0}_{"c"} \;\;+\;\;
+ \underbrace{1 * 3}_{"d"}
+ \]
+
+ \hfill\mbox{[1 Mark]}
+
+\item[(4)] Implement first a function that calculates the overlap
+ between two documents, say $d_1$ and $d_2$, according to the formula
+
+ \[
+ \texttt{overlap}(d_1, d_2) = \frac{d_1 \cdot d_2}{max(d_1^2, d_2^2)}
\]
- \noindent That is, we want to find the first position where the
- result of $f$ is not \texttt{None}, if there is one. Note that
- `inside' \texttt{first}, you do not (need to) know anything about
- the argument $f$ except its type, namely \texttt{Pos =>
- Option[Path]}. There is one additional point however you should
- take into account when implementing \texttt{first}: you will need to
- calculate what the result of $f(x)$ is; your code should do this
- only \textbf{once} and for as \textbf{few} elements in the list as
- possible! Do not calculate $f(x)$ for all elements and then see which
- is the first \texttt{Some}.\\\mbox{}\hfill[1 Mark]
-
-\item[(2b)] Implement a \texttt{first\_tour} function that uses the
- \texttt{first}-function from (2a), and searches recursively for a tour.
- As there might not be such a tour at all, the \texttt{first\_tour} function
- needs to return a value of type
- \texttt{Option[Path]}.\\\mbox{}\hfill[2 Marks]
+ This function should return a \texttt{Double} between 0 and 1. The
+ overlap between the lists in (2) is $0.5384615384615384$.
+
+ Second implement a function that calculates the similarity of
+ two strings, by first extracting the strings using the function from (1)
+ and then calculating the overlap.
+ \hfill\mbox{[1 Mark]}
\end{itemize}
-\noindent
-\textbf{Testing:} The \texttt{first\_tour} function will be called with board
-sizes of up to $8 \times 8$.
-\bigskip
-
-\noindent
-\textbf{Hints:} a useful list function: \texttt{.filter(..)} filters a
-list according to a boolean function; a useful option function:
-\texttt{.isDefined} returns true, if an option is \texttt{Some(..)};
-anonymous functions can be constructed using \texttt{(x:Int) => ...},
-this functions takes an \texttt{Int} as an argument.
-%%\newpage
-\subsection*{Part 2 (3 Marks)}
-
-As you should have seen in Part 1, a naive search for tours beyond
-$8 \times 8$ boards and also searching for closed tours even on small
-boards takes too much time. There is a heuristic, called \emph{Warnsdorf's
-Rule} that can speed up finding a tour. This heuristic states that a
-knight is moved so that it always proceeds to the field from which the
-knight will have the \underline{fewest} onward moves. For example for
-a knight on field $(1, 3)$, the field $(0, 1)$ has the fewest possible
-onward moves, namely 2.
+\newpage
+You are creating Danube.org, which you hope will be the next big thing
+in online movie provider. You know that you can save money by
+anticipating what movies people will rent; you will pass these savings
+on to your users by offering a discount if they rent movies that Danube.org
+recommends. This assignment is meant to calculate
-\chessboard[maxfield=g7,
- pgfstyle= {[base,at={\pgfpoint{0pt}{-0.5ex}}]text},
- text = \small 3, markfield=Z5,
- text = \small 7, markfield=b5,
- text = \small 7, markfield=c4,
- text = \small 7, markfield=c2,
- text = \small 5, markfield=b1,
- text = \small 2, markfield=Z1,
- setpieces={Na3}]
-
-\noindent
-Warnsdorf's Rule states that the moves on the board above should be
-tried in the order
-
-\[
-(0, 1), (0, 5), (2, 1), (2, 5), (3, 4), (3, 2)
-\]
-\noindent
-Whenever there are ties, the corresponding onward moves can be in any
-order. When calculating the number of onward moves for each field, we
-do not count moves that revisit any field already visited.
-
-\subsubsection*{Tasks (file knight3.scala)}
+To do this, you offer an incentive for people to upload their lists of
+recommended books. From their lists, you can establish suggested
+pairs. A pair of books is a suggested pair if both books appear on one
+person’s recommendation list. Of course, some suggested pairs are more
+popular than others. Also, any given book is paired with some books
+much more frequently than with others.
-\begin{itemize}
-\item[(3a)] Write a function \texttt{ordered\_moves} that calculates a list of
- onward moves like in (1b) but orders them according to the
- Warnsdorf’s Rule. That means moves with the fewest legal onward moves
- should come first (in order to be tried out first). \hfill[1 Mark]
-
-\item[(3b)] Implement a \texttt{first\_closed-tour\_heuristic}
- function that searches for a
- \textbf{closed} tour on a $6\times 6$ board. It should use the
- \texttt{first}-function from (2a) and tries out onward moves according to
- the \texttt{ordered\_moves} function from (3a). It is more likely to find
- a solution when started in the middle of the board (that is
- position $(dimension / 2, dimension / 2)$). \hfill[1 Mark]
-
-\item[(3c)] Implement a \texttt{first\_tour\_heuristic} function
- for boards up to
- $40\times 40$. It is the same function as in (3b) but searches for
- tours (not just closed tours). You have to be careful to write a
- tail-recursive function of the \texttt{first\_tour\_heuristic} function
- otherwise you will get problems with stack-overflows.\\
- \mbox{}\hfill[1 Mark]
-\end{itemize}
-\bigskip
-
-\noindent
-\textbf{Hints:} a useful list function: \texttt{.sortBy} sorts a list
-according to a component given by the function; a function can be
-tested to be tail recursive by annotation \texttt{@tailrec}, which is
-made available by importing \texttt{scala.annotation.tailrec}.