1 % !TEX program = xelatex  | 
         | 
     2 \documentclass{article} | 
         | 
     3 \usepackage{../style} | 
         | 
     4 \usepackage{disclaimer} | 
         | 
     5 \usepackage{../langs} | 
         | 
     6   | 
         | 
     7 \begin{document} | 
         | 
     8   | 
         | 
     9   | 
         | 
    10 %% should ask to lower case the words.  | 
         | 
    11   | 
         | 
    12 \section*{Part 7 (Scala)} | 
         | 
    13   | 
         | 
    14 \mbox{}\hfill\textit{``What one programmer can do in one month,}\\ | 
         | 
    15 \mbox{}\hfill\textit{two programmers can do in two months.''}\smallskip\\ | 
         | 
    16 \mbox{}\hfill\textit{ --- Frederick P.~Brooks (author of The Mythical Man-Month)}\bigskip\medskip | 
         | 
    17   | 
         | 
    18 \noindent  | 
         | 
    19 You are asked to implement Scala programs for measuring similarity in  | 
         | 
    20 texts, and for recommending movies according to a ratings list. The  | 
         | 
    21 preliminary part~(4\%) is due on \cwSEVEN{} at 4pm; the core part is due | 
         | 
    22 on \cwSEVENa{} at 4pm.   Note the core part might include material you | 
         | 
    23 have not yet seen in the first two lectures. \bigskip  | 
         | 
    24   | 
         | 
    25 \IMPORTANT{} | 
         | 
    26   | 
         | 
    27 \noindent  | 
         | 
    28 Also note that the running time of each part will be restricted to a  | 
         | 
    29 maximum of 30 seconds on my laptop.  | 
         | 
    30   | 
         | 
    31 \DISCLAIMER{} | 
         | 
    32   | 
         | 
    33   | 
         | 
    34 \subsection*{Reference Implementation} | 
         | 
    35   | 
         | 
    36 Like the C++ part, the Scala part works like this: you  | 
         | 
    37 push your files to GitHub and receive (after sometimes a long delay) some  | 
         | 
    38 automated feedback. In the end we will take a snapshot of the submitted files and  | 
         | 
    39 apply an automated marking script to them.\medskip  | 
         | 
    40   | 
         | 
    41 \noindent  | 
         | 
    42 In addition, the Scala part comes with reference  | 
         | 
    43 implementations in form of \texttt{jar}-files. This allows you to run | 
         | 
    44 any test cases on your own computer. For example you can call Scala on  | 
         | 
    45 the command line with the option \texttt{-cp docdiff.jar} and then | 
         | 
    46 query any function from the template file. Say you want to find out  | 
         | 
    47 what the function \texttt{occurrences} produces: for this you just need | 
         | 
    48 to prefix it with the object name \texttt{CW7a} (and \texttt{CW7b} | 
         | 
    49 respectively for \texttt{danube.jar}).  If you want to find out what | 
         | 
    50 these functions produce for the list \texttt{List("a", "b", "b")}, | 
         | 
    51 you would type something like:  | 
         | 
    52   | 
         | 
    53 \begin{lstlisting}[language={},numbers=none,basicstyle=\ttfamily\small] | 
         | 
    54 $ scala -cp docdiff.jar  | 
         | 
    55     | 
         | 
    56 scala> CW7a.occurrences(List("a", "b", "b")) | 
         | 
    57 ...  | 
         | 
    58 \end{lstlisting}%$ | 
         | 
    59   | 
         | 
    60 \subsection*{Hints} | 
         | 
    61   | 
         | 
    62 \noindent  | 
         | 
    63 \textbf{For Preliminary Part:} useful operations involving regular | 
         | 
    64 expressions:  | 
         | 
    65 \[\texttt{reg.findAllIn(s).toList}\] | 
         | 
    66 \noindent finds all  | 
         | 
    67 substrings in \texttt{s} according to a regular regular expression | 
         | 
    68 \texttt{reg}; useful list operations: \texttt{.distinct} | 
         | 
    69 removing duplicates from a list, \texttt{.count} counts the number of | 
         | 
    70 elements in a list that satisfy some condition, \texttt{.toMap} | 
         | 
    71 transfers a list of pairs into a Map, \texttt{.sum} adds up a list of | 
         | 
    72 integers, \texttt{.max} calculates the maximum of a list.\bigskip | 
         | 
    73   | 
         | 
    74 \noindent  | 
         | 
    75 \textbf{For Core Part:} use \texttt{.split(",").toList} for splitting | 
         | 
    76 strings according to commas (similarly $\backslash$\texttt{n}), | 
         | 
    77 \texttt{.getOrElse(..,..)} allows to query a Map, but also gives a | 
         | 
    78 default value if the Map is not defined, a Map can be `updated' by  | 
         | 
    79 using \texttt{+}, \texttt{.contains} and \texttt{.filter} can test whether | 
         | 
    80 an element is included in a list, and respectively filter out elements in a list,  | 
         | 
    81 \texttt{.sortBy(\_.\_2)} sorts a list of pairs according to the second | 
         | 
    82 elements in the pairs---the sorting is done from smallest to highest,  | 
         | 
    83 \texttt{.take(n)} for taking some elements in a list (takes fewer if the list | 
         | 
    84 contains less than \texttt{n} elements). | 
         | 
    85   | 
         | 
    86   | 
         | 
    87 \newpage  | 
         | 
    88 \subsection*{Preliminary Part (4 Marks, file docdiff.scala)} | 
         | 
    89   | 
         | 
    90 It seems source code plagiarism---stealing and submitting someone  | 
         | 
    91 else's code---is a serious problem at other  | 
         | 
    92 universities.\footnote{Surely, King's students, after all their | 
         | 
    93   instructions and warnings, would never commit such an offence. Yes?}  | 
         | 
    94 Detecting such plagiarism is time-consuming and disheartening for  | 
         | 
    95 lecturers at those universities. To aid these poor souls, let's  | 
         | 
    96 implement in this part a program that determines the similarity  | 
         | 
    97 between two documents (be they source code or texts in English). A  | 
         | 
    98 document will be represented as a list of strings.  | 
         | 
    99   | 
         | 
   100   | 
         | 
   101 \subsection*{Tasks} | 
         | 
   102   | 
         | 
   103 \begin{itemize} | 
         | 
   104 \item[(1)] Implement a function that `cleans' a string by finding all  | 
         | 
   105   (proper) words in this string. For this use the regular expression  | 
         | 
   106   \texttt{\textbackslash{}w+} for recognising words and the library function | 
         | 
   107   \texttt{findAllIn}. The function should return a document (a list of | 
         | 
   108   strings).\\  | 
         | 
   109   \mbox{}\hfill [1 Mark] | 
         | 
   110   | 
         | 
   111 \item[(2)] In order to compute the overlap between two documents, we  | 
         | 
   112   associate each document with a \texttt{Map}. This Map represents the | 
         | 
   113   strings in a document and how many times these strings occur in the  | 
         | 
   114   document. A simple (though slightly inefficient) method for counting  | 
         | 
   115   the number of string-occurrences in a document is as follows: remove  | 
         | 
   116   all duplicates from the document; for each of these (unique)  | 
         | 
   117   strings, count how many times they occur in the original document.  | 
         | 
   118   Return a Map associating strings with occurrences. For example  | 
         | 
   119   | 
         | 
   120   \begin{center} | 
         | 
   121   \pcode{occurrences(List("a", "b", "b", "c", "d"))} | 
         | 
   122   \end{center} | 
         | 
   123   | 
         | 
   124   produces \pcode{Map(a -> 1, b -> 2, c -> 1, d -> 1)} and | 
         | 
   125   | 
         | 
   126   \begin{center} | 
         | 
   127   \pcode{occurrences(List("d", "b", "d", "b", "d"))} | 
         | 
   128   \end{center} | 
         | 
   129   | 
         | 
   130   produces \pcode{Map(d -> 3, b -> 2)}.\hfill[1 Mark] | 
         | 
   131   | 
         | 
   132 \item[(3)] You can think of the Maps calculated under (2) as memory-efficient  | 
         | 
   133   representations of sparse ``vectors''. In this subtask you need to  | 
         | 
   134   implement the \emph{product} of two such vectors, sometimes also called | 
         | 
   135   \emph{dot product} of two vectors.\footnote{\url{https://en.wikipedia.org/wiki/Dot_product}} | 
         | 
   136   | 
         | 
   137   For this dot product, implement a function that takes two documents  | 
         | 
   138   (\texttt{List[String]}) as arguments. The function first calculates | 
         | 
   139   the (unique) strings in both. For each string, it multiplies the  | 
         | 
   140   corresponding occurrences in each document. If a string does not  | 
         | 
   141   occur in one of the documents, then the product for this string is zero. At the end  | 
         | 
   142   you need to add up all products. For the two documents in (2) the dot  | 
         | 
   143   product is 7, because  | 
         | 
   144   | 
         | 
   145   \[  | 
         | 
   146     \underbrace{1 * 0}_{"a"} \;\;+\;\; | 
         | 
   147     \underbrace{2 * 2}_{"b"} \;\;+\;\; | 
         | 
   148     \underbrace{1 * 0}_{"c"} \;\;+\;\; | 
         | 
   149     \underbrace{1 * 3}_{"d"} \qquad = 7 | 
         | 
   150   \]    | 
         | 
   151     | 
         | 
   152   \hfill\mbox{[1 Mark]} | 
         | 
   153   | 
         | 
   154 \item[(4)] Implement first a function that calculates the overlap  | 
         | 
   155   between two documents, say $d_1$ and $d_2$, according to the formula  | 
         | 
   156   | 
         | 
   157   \[  | 
         | 
   158   \texttt{overlap}(d_1, d_2) = \frac{d_1 \cdot d_2}{max(d_1^2, d_2^2)}   | 
         | 
   159   \]  | 
         | 
   160   | 
         | 
   161   where $d_1^2$ means $d_1 \cdot d_1$ and so on.  | 
         | 
   162   You can expect this function to return a \texttt{Double} between 0 and 1. The | 
         | 
   163   overlap between the lists in (2) is $0.5384615384615384$.  | 
         | 
   164   | 
         | 
   165   Second, implement a function that calculates the similarity of  | 
         | 
   166   two strings, by first extracting the substrings using the clean  | 
         | 
   167   function from (1)  | 
         | 
   168   and then calculating the overlap of the resulting documents.\\  | 
         | 
   169   \mbox{}\hfill\mbox{[1 Mark]} | 
         | 
   170 \end{itemize}\bigskip | 
         | 
   171   | 
         | 
   172   | 
         | 
   173 \subsection*{Core Part (6 Marks, file danube.scala)} | 
         | 
   174   | 
         | 
   175 You are creating Danube.co.uk which you hope will be the next big thing  | 
         | 
   176 in online movie renting. You know that you can save money by  | 
         | 
   177 anticipating what movies people will rent; you will pass these savings  | 
         | 
   178 on to your users by offering a discount if they rent movies that  | 
         | 
   179 Danube.co.uk recommends.    | 
         | 
   180   | 
         | 
   181 Your task is to generate \emph{two} movie recommendations for every | 
         | 
   182 movie a user rents. To do this, you calculate what other  | 
         | 
   183 renters, who also watched this movie, suggest by giving positive ratings.  | 
         | 
   184 Of course, some suggestions are more popular than others. You need to find  | 
         | 
   185 the two most-frequently suggested movies. Return fewer recommendations,  | 
         | 
   186 if there are fewer movies suggested.  | 
         | 
   187   | 
         | 
   188 The calculations will be based on the small datasets which the research lab  | 
         | 
   189 GroupLens provides for education and development purposes.  | 
         | 
   190   | 
         | 
   191 \begin{center} | 
         | 
   192 \url{https://grouplens.org/datasets/movielens/} | 
         | 
   193 \end{center} | 
         | 
   194   | 
         | 
   195 \noindent  | 
         | 
   196 The slightly adapted CSV-files should be downloaded in your Scala  | 
         | 
   197 file from the URLs:  | 
         | 
   198   | 
         | 
   199   | 
         | 
   200 \begin{center} | 
         | 
   201 \begin{tabular}{ll}   | 
         | 
   202   \url{https://nms.kcl.ac.uk/christian.urban/ratings.csv} & (940 KByte)\\ | 
         | 
   203   \url{https://nms.kcl.ac.uk/christian.urban/movies.csv}  & (280 KByte)\\ | 
         | 
   204 \end{tabular} | 
         | 
   205 \end{center} | 
         | 
   206   | 
         | 
   207 \noindent  | 
         | 
   208 The ratings.csv file is organised as userID,   | 
         | 
   209 movieID, and rating (which is between 0 and 5, with \emph{positive} ratings | 
         | 
   210 being 4 and 5). The file movie.csv is organised as  | 
         | 
   211 movieID and full movie name. Both files still contain the usual  | 
         | 
   212 CSV-file header (first line). In this part you are asked  | 
         | 
   213 to implement functions that process these files. If bandwidth  | 
         | 
   214 is an issue for you, download the files locally, but in the submitted  | 
         | 
   215 version use \texttt{Source.fromURL} instead of \texttt{Source.fromFile}. | 
         | 
   216   | 
         | 
   217 \subsection*{Tasks} | 
         | 
   218   | 
         | 
   219 \begin{itemize} | 
         | 
   220 \item[(1)] Implement the function \pcode{get_csv_url} which takes an | 
         | 
   221   URL-string as argument and requests the corresponding file. The two  | 
         | 
   222   URLs of interest are \pcode{ratings_url} and \pcode{movies_url}, | 
         | 
   223   which correspond to CSV-files mentioned above.  The function should  | 
         | 
   224   return the CSV-file appropriately broken up into lines, and the  | 
         | 
   225   first line should be dropped (that is omit the header of the CSV-file).  | 
         | 
   226   The result is a list of strings (the lines in the file). In case  | 
         | 
   227   the url does not produce a file, return the empty list.\\  | 
         | 
   228   \mbox{}\hfill [1 Mark] | 
         | 
   229   | 
         | 
   230 \item[(2)] Implement two functions that process the (broken up)  | 
         | 
   231   CSV-files from (1). The \pcode{process_ratings} function filters out all | 
         | 
   232   ratings below 4 and returns a list of (userID, movieID) pairs. The  | 
         | 
   233   \pcode{process_movies} function returns a list of (movieID, title) pairs. | 
         | 
   234   Note the input to these functions will be the output of the function  | 
         | 
   235   \pcode{get_csv_url}.\\ | 
         | 
   236   \mbox{}\hfill [1 Mark] | 
         | 
   237 %\end{itemize}   | 
         | 
   238 %    | 
         | 
   239 %  | 
         | 
   240 %\subsection*{Part 3 (4 Marks, file danube.scala)} | 
         | 
   241 %  | 
         | 
   242 %\subsection*{Tasks} | 
         | 
   243 %  | 
         | 
   244 %\begin{itemize} | 
         | 
   245 \item[(3)] Implement a kind of grouping function that calculates a Map  | 
         | 
   246   containing the userIDs and all the corresponding recommendations for  | 
         | 
   247   this user (list of movieIDs). This should be implemented in a  | 
         | 
   248   tail-recursive fashion using a Map as accumulator. This Map is set to  | 
         | 
   249   \pcode{Map()} at the beginning of the calculation. For example | 
         | 
   250   | 
         | 
   251 \begin{lstlisting}[numbers=none] | 
         | 
   252 val lst = List(("1", "a"), ("1", "b"), | 
         | 
   253                ("2", "x"), ("3", "a"), | 
         | 
   254                ("2", "y"), ("3", "c")) | 
         | 
   255 groupById(lst, Map())  | 
         | 
   256 \end{lstlisting} | 
         | 
   257   | 
         | 
   258 returns the ratings map  | 
         | 
   259   | 
         | 
   260 \begin{center} | 
         | 
   261   \pcode{Map(1 -> List(b, a), 2 -> List(y, x), 3 -> List(c, a))}. | 
         | 
   262 \end{center} | 
         | 
   263   | 
         | 
   264 \noindent  | 
         | 
   265 In which order the elements of the list are given is unimportant.\\  | 
         | 
   266 \mbox{}\hfill [1 Mark] | 
         | 
   267   | 
         | 
   268 \item[(4)] Implement a function that takes a ratings map and a movieID  | 
         | 
   269   as arguments.  The function calculates all suggestions containing the  | 
         | 
   270   given movie in its recommendations. It returns a list of all these  | 
         | 
   271   recommendations (each of them is a list and needs to have the given  | 
         | 
   272   movie deleted, otherwise it might happen we recommend the same movie  | 
         | 
   273   ``back''). For example for the Map from above and the movie  | 
         | 
   274   \pcode{"y"} we obtain \pcode{List(List("x"))}, and for the movie | 
         | 
   275   \pcode{"a"} we get \pcode{List(List("b"), List("c"))}.\\ | 
         | 
   276   \mbox{}\hfill [1 Mark] | 
         | 
   277   | 
         | 
   278 \item[(5)] Implement a suggestions function which takes a ratings map  | 
         | 
   279   and a movieID as arguments. It calculates all the recommended movies  | 
         | 
   280   sorted according to the most frequently suggested movie(s) sorted  | 
         | 
   281   first. This function returns \emph{all} suggested movieIDs as a list of | 
         | 
   282   strings.\\  | 
         | 
   283   \mbox{}\hfill [1 Mark] | 
         | 
   284   | 
         | 
   285 \item[(6)]    | 
         | 
   286   Implement then a recommendation function which generates a maximum  | 
         | 
   287   of two most-suggested movies (as calculated above). But it returns  | 
         | 
   288   the actual movie name, not the movieID. If fewer movies are recommended,  | 
         | 
   289   then return fewer than two movie names.\\  | 
         | 
   290   \mbox{}\hfill [1 Mark] | 
         | 
   291 \end{itemize} | 
         | 
   292   | 
         | 
   293 \end{document}  | 
         | 
   294   | 
         | 
   295 %%% Local Variables:   | 
         | 
   296 %%% mode: latex  | 
         | 
   297 %%% TeX-master: t  | 
         | 
   298 %%% End:   | 
         |