% !TEX program = xelatex\documentclass{article}\usepackage{../styles/style}\usepackage{disclaimer}\usepackage{../styles/langs}\begin{document}%% should ask to lower case the words.\section*{Core Part 2 (Scala, 3 Marks)}\mbox{}\hfill\textit{``What one programmer can do in one month,}\\\mbox{}\hfill\textit{two programmers can do in two months.''}\smallskip\\\mbox{}\hfill\textit{ --- Frederick P.~Brooks (author of The Mythical Man-Month)}\bigskip\medskip\IMPORTANTNONE{}\noindentAlso note that the running time of each part will be restricted to amaximum of 30 seconds on my laptop.\DISCLAIMER{}\subsection*{Reference Implementation}Like the C++ part, the Scala part works like this: youpush your files to GitHub and receive (after sometimes a long delay) someautomated feedback. In the end we will take a snapshot of the submitted files andapply an automated marking script to them.\medskip\noindentIn addition, the Scala part comes with referenceimplementations in form of \texttt{jar}-files. This allows you to runany test cases on your own computer. For example you can call Scala onthe command line with the option \texttt{-cp docdiff.jar} and thenquery any function from the template file. Say you want to find outwhat the function \texttt{occurrences} produces: for this you just needto prefix it with the object name \texttt{C2}. If you want to find out whatthese functions produce for the list \texttt{List("a", "b", "b")},you would type something like:\begin{lstlisting}[language={},numbers=none,basicstyle=\ttfamily\small]$ scala -cp docdiff.jarscala> C2.occurrences(List("a", "b", "b"))...\end{lstlisting}%$\subsection*{Hints}\noindent\textbf{For the Core Part 2:} useful operations involving regularexpressions:\[\texttt{reg.findAllIn(s).toList}\]\noindent finds allsubstrings in \texttt{s} according to a regular regular expression\texttt{reg}; useful list operations: \texttt{.distinct}removing duplicates from a list, \texttt{.count} counts the number ofelements in a list that satisfy some condition, \texttt{.toMap}transfers a list of pairs into a Map, \texttt{.sum} adds up a list ofintegers, \texttt{.max} calculates the maximum of a list.\bigskip\newpage\subsection*{Core Part 2 (3 Marks, file docdiff.scala)}It seems plagiarism---stealing and submitting someoneelse's code---is a serious problem at otheruniversities.\footnote{Surely, King's students, after all their instructions and warnings, would never commit such an offence. Yes?}Detecting such plagiarism is time-consuming and disheartening forlecturers at those universities. To aid these poor souls, let'simplement in this part a program that determines the similaritybetween two documents (be they source code or texts in English). Adocument will be represented as a list of strings.\subsection*{Tasks}\begin{itemize}\item[(1)] Implement a function that `cleans' a string by finding all (proper) words in the string. For this use the regular expression \texttt{\textbackslash{}w+} for recognising words and the library function \texttt{findAllIn}. The function should return a document (a list of strings). \mbox{}\hfill\mbox{[0.5 Marks]}\item[(2)] In order to compute the overlap between two documents, we associate each document with a \texttt{Map}. This Map represents the strings in a document and how many times these strings occur in the document. A simple (though slightly inefficient) method for counting the number of string-occurrences in a document is as follows: remove all duplicates from the document; for each of these (unique) strings, count how many times they occur in the original document. Return a Map associating strings with occurrences. For example \begin{center} \pcode{occurrences(List("a", "b", "b", "c", "d"))} \end{center} produces \pcode{Map(a -> 1, b -> 2, c -> 1, d -> 1)} and \begin{center} \pcode{occurrences(List("d", "b", "d", "b", "d"))} \end{center} produces \pcode{Map(d -> 3, b -> 2)}.\hfill[1 Mark]\item[(3)] You can think of the Maps calculated under (2) as memory-efficient representations of sparse ``vectors''. In this subtask you need to implement the \emph{product} of two such vectors, sometimes also called \emph{dot product} of two vectors.\footnote{\url{https://en.wikipedia.org/wiki/Dot_product}} For this dot product, implement a function that takes two documents (\texttt{List[String]}) as arguments. The function first calculates the (unique) strings in both. For each string, it multiplies the corresponding occurrences in each document. If a string does not occur in one of the documents, then the product for this string is zero. At the end you need to add up all products. For the two documents in (2) the dot product is 7, because \[ \underbrace{1 * 0}_{"a"} \;\;+\;\; \underbrace{2 * 2}_{"b"} \;\;+\;\; \underbrace{1 * 0}_{"c"} \;\;+\;\; \underbrace{1 * 3}_{"d"} \qquad = 7 \] \hfill\mbox{[1 Mark]}\item[(4)] Implement first a function that calculates the overlap between two documents, say $d_1$ and $d_2$, according to the formula \[ \texttt{overlap}(d_1, d_2) = \frac{d_1 \cdot d_2}{max(d_1^2, d_2^2)} \] where $d_1^2$ means $d_1 \cdot d_1$ and so on. You can expect this function to return a \texttt{Double} between 0 and 1. The overlap between the lists in (2) is $0.5384615384615384$. Second, implement a function that calculates the similarity of two strings, by first extracting the substrings using the clean function from (1) and then calculating the overlap of the resulting documents.\\ \mbox{}\hfill\mbox{[0.5 Marks]}\end{itemize}\end{document} %%% Local Variables: %%% mode: latex%%% TeX-master: t%%% End: