\documentclass{article}\usepackage{../style}\usepackage{disclaimer}\usepackage{../langs}\begin{document}\section*{Coursework 7 (DocDiff and Danube.org)}This coursework is worth 10\%. The first part and second part are dueon 22 November at 11pm; the third, more advanced part, is due on 21December at 11pm. You are asked to implement Scala programs formeasuring similarity in texts and for recommending moviesaccording to a ratings list. Note the second part might includematerial you have not yet seen in the first two lectures. \bigskip\IMPORTANT{}\noindentAlso note that the running time of each part will be restricted to amaximum of 30 seconds on my laptop.\DISCLAIMER{}\subsection*{Reference Implementation}Like the C++ assignments, the Scala assignments will work like this: youpush your files to GitHub and receive (after sometimes a long delay) someautomated feedback. In the end we take a snapshot of the submitted files andapply an automated marking script to them.In addition, the Scala assignments come with a referenceimplementation in form of a \texttt{jar}-file. This allows you to runany test cases on your own computer. For example you can call Scala onthe command line with the option \texttt{-cp docdiff.jar} and thenquery any function from the template file. Say you want to find outwhat the function \texttt{occurences} produces: for this you just needto prefix it with the object name \texttt{CW7a} (and \texttt{CW7b}respectively for \texttt{danube.jar}). If you want to find out whatthese functions produce for the list \texttt{List("a", "b", "b")},you would type something like:\begin{lstlisting}[language={},numbers=none,basicstyle=\ttfamily\small]$ scala -cp docdiff.jarscala> CW7a.occurences(List("a", "b", "b"))...\end{lstlisting}%$\subsection*{Hints}\newpage\subsection*{Part 1 (4 Marks, file docdiff.scala)}It seems source code plagiarism---stealing someone else's code---is aserious problem at other universities.\footnote{Surely, King's students, after all their instructions and warnings, would never commit such an offence.} Dedecting such plagiarism is time-consumingand disheartening. To aid the poor lecturers at other universities,let's implement a program that determines the similarity between twodocuments (be they code or English texts). A document will berepresented as a list of strings.\subsection*{Tasks}\begin{itemize}\item[(1)] Implement a function that cleans a string by finding all words in this string. For this use the regular expression \texttt{"$\backslash$w+"} and the library function \texttt{findAllIn}. The function should return a list of strings.\\ \mbox{}\hfill [1 Mark]\item[(2)] In order to compute the similarity between two documents, we associate each document with a \texttt{Map}. This Map represents the strings in a document and how many times these strings occur in a document. A simple (though slightly inefficient) method for counting the number of string-occurences in a document is as follows: remove all duplicates from the document; for each of these (unique) strings, count how many times they occur in the original document. Return a Map from strings to occurences. For example \begin{center} \pcode{occurences(List("a", "b", "b", "c", "d"))} \end{center} produces \pcode{Map(a -> 1, b -> 2, c -> 1, d -> 1)} and \begin{center} \pcode{occurences(List("d", "b", "d", "b", "d"))} \end{center} produces \pcode{Map(d -> 3, b -> 2)}.\hfill[1 Mark]\item[(3)] You can think of the Maps calculated under (2) as efficient representations of sparse ``vectors''. In this subtask you need to implement the \emph{product} of two vectors, sometimes also called \emph{dot product}.\footnote{\url{https://en.wikipedia.org/wiki/Dot_product}} For this implement a function that takes two documents (\texttt{List[String]}) as arguments. The function first calculates the (unique) strings in both. For each string, it multiplies the occurences in each document. If a string does not occur in one of the documents, then the product is zero. At the end you sum all products. For the two documents in (2) the dot product is 7: \[ \underbrace{1 * 0}_{"a"} \;\;+\;\; \underbrace{2 * 2}_{"b"} \;\;+\;\; \underbrace{1 * 0}_{"c"} \;\;+\;\; \underbrace{1 * 3}_{"d"} \] \hfill\mbox{[1 Mark]}\item[(4)] Implement first a function that calculates the overlap between two documents, say $d_1$ and $d_2$, according to the formula \[ \texttt{overlap}(d_1, d_2) = \frac{d_1 \cdot d_2}{max(d_1^2, d_2^2)} \] This function should return a \texttt{Double} between 0 and 1. The overlap between the lists in (2) is $0.5384615384615384$. Second implement a function that calculates the similarity of two strings, by first extracting the strings using the function from (1) and then calculating the overlap. \hfill\mbox{[1 Mark]}\end{itemize}\newpageYou are creating Danube.org, which you hope will be the next big thingin online movie provider. You know that you can save money byanticipating what movies people will rent; you will pass these savingson to your users by offering a discount if they rent movies that Danube.orgrecommends. This assignment is meant to calculate To do this, you offer an incentive for people to upload their lists ofrecommended books. From their lists, you can establish suggestedpairs. A pair of books is a suggested pair if both books appear on oneperson’s recommendation list. Of course, some suggested pairs are morepopular than others. Also, any given book is paired with some booksmuch more frequently than with others.\end{document}%%% Local Variables: %%% mode: latex%%% TeX-master: t%%% End: