\documentclass{article}
\usepackage{../style}
\usepackage{disclaimer}
\usepackage{../langs}
\begin{document}
\section*{Coursework 7 (DocDiff and Danube.org)}
This coursework is worth 10\%. The first part and second part are due
on 22 November at 11pm; the third, more advanced part, is due on 21
December at 11pm. You are asked to implement Scala programs for
measuring similarity in texts and for recommending movies
according to a ratings list. Note the second part might include
material you have not yet seen in the first two lectures. \bigskip
\IMPORTANT{}
\noindent
Also note that the running time of each part will be restricted to a
maximum of 30 seconds on my laptop.
\DISCLAIMER{}
\subsection*{Reference Implementation}
Like the C++ assignments, the Scala assignments will work like this: you
push your files to GitHub and receive (after sometimes a long delay) some
automated feedback. In the end we take a snapshot of the submitted files and
apply an automated marking script to them.
In addition, the Scala assignments come with a reference
implementation in form of a \texttt{jar}-file. This allows you to run
any test cases on your own computer. For example you can call Scala on
the command line with the option \texttt{-cp docdiff.jar} and then
query any function from the template file. Say you want to find out
what the function \texttt{occurences} produces: for this you just need
to prefix it with the object name \texttt{CW7a} (and \texttt{CW7b}
respectively for \texttt{danube.jar}). If you want to find out what
these functions produce for the list \texttt{List("a", "b", "b")},
you would type something like:
\begin{lstlisting}[language={},numbers=none,basicstyle=\ttfamily\small]
$ scala -cp docdiff.jar
scala> CW7a.occurences(List("a", "b", "b"))
...
\end{lstlisting}%$
\subsection*{Hints}
\newpage
\subsection*{Part 1 (4 Marks, file docdiff.scala)}
It seems source code plagiarism---stealing someone else's code---is a
serious problem at other universities.\footnote{Surely, King's
students, after all their instructions and warnings, would never
commit such an offence.} Dedecting such plagiarism is time-consuming
and disheartening. To aid the poor lecturers at other universities,
let's implement a program that determines the similarity between two
documents (be they code or English texts). A document will be
represented as a list of strings.
\subsection*{Tasks}
\begin{itemize}
\item[(1)] Implement a function that cleans a string by finding all
words in this string. For this use the regular expression
\texttt{"$\backslash$w+"} and the library function
\texttt{findAllIn}. The function should return a list of
strings.\\
\mbox{}\hfill [1 Mark]
\item[(2)] In order to compute the similarity between two documents, we
associate each document with a \texttt{Map}. This Map represents the
strings in a document and how many times these strings occur in a
document. A simple (though slightly inefficient) method for counting
the number of string-occurences in a document is as follows: remove
all duplicates from the document; for each of these (unique)
strings, count how many times they occur in the original document.
Return a Map from strings to occurences. For example
\begin{center}
\pcode{occurences(List("a", "b", "b", "c", "d"))}
\end{center}
produces \pcode{Map(a -> 1, b -> 2, c -> 1, d -> 1)} and
\begin{center}
\pcode{occurences(List("d", "b", "d", "b", "d"))}
\end{center}
produces \pcode{Map(d -> 3, b -> 2)}.\hfill[1 Mark]
\item[(3)] You can think of the Maps calculated under (2) as efficient
representations of sparse ``vectors''. In this subtask you need to
implement the \emph{product} of two vectors, sometimes also called
\emph{dot product}.\footnote{\url{https://en.wikipedia.org/wiki/Dot_product}}
For this implement a function that takes two documents
(\texttt{List[String]}) as arguments. The function first calculates
the (unique) strings in both. For each string, it multiplies the
occurences in each document. If a string does not occur in one of the
documents, then the product is zero. At the end you
sum all products. For the two documents in (2) the dot product is 7:
\[
\underbrace{1 * 0}_{"a"} \;\;+\;\;
\underbrace{2 * 2}_{"b"} \;\;+\;\;
\underbrace{1 * 0}_{"c"} \;\;+\;\;
\underbrace{1 * 3}_{"d"}
\]
\hfill\mbox{[1 Mark]}
\item[(4)] Implement first a function that calculates the overlap
between two documents, say $d_1$ and $d_2$, according to the formula
\[
\texttt{overlap}(d_1, d_2) = \frac{d_1 \cdot d_2}{max(d_1^2, d_2^2)}
\]
This function should return a \texttt{Double} between 0 and 1. The
overlap between the lists in (2) is $0.5384615384615384$.
Second implement a function that calculates the similarity of
two strings, by first extracting the strings using the function from (1)
and then calculating the overlap.
\hfill\mbox{[1 Mark]}
\end{itemize}
\newpage
You are creating Danube.org, which you hope will be the next big thing
in online movie provider. You know that you can save money by
anticipating what movies people will rent; you will pass these savings
on to your users by offering a discount if they rent movies that Danube.org
recommends. This assignment is meant to calculate
To do this, you offer an incentive for people to upload their lists of
recommended books. From their lists, you can establish suggested
pairs. A pair of books is a suggested pair if both books appear on one
person’s recommendation list. Of course, some suggested pairs are more
popular than others. Also, any given book is paired with some books
much more frequently than with others.
\end{document}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: t
%%% End: