1 % !TEX program = xelatex |
|
2 \documentclass{article} |
|
3 \usepackage{../style} |
|
4 \usepackage{disclaimer} |
|
5 \usepackage{../langs} |
|
6 |
|
7 \begin{document} |
|
8 |
|
9 |
|
10 %% should ask to lower case the words. |
|
11 |
|
12 \section*{Part 7 (Scala)} |
|
13 |
|
14 \mbox{}\hfill\textit{``What one programmer can do in one month,}\\ |
|
15 \mbox{}\hfill\textit{two programmers can do in two months.''}\smallskip\\ |
|
16 \mbox{}\hfill\textit{ --- Frederick P.~Brooks (author of The Mythical Man-Month)}\bigskip\medskip |
|
17 |
|
18 \noindent |
|
19 You are asked to implement Scala programs for measuring similarity in |
|
20 texts, and for recommending movies according to a ratings list. The |
|
21 preliminary part~(4\%) is due on \cwSEVEN{} at 4pm; the core part is due |
|
22 on \cwSEVENa{} at 4pm. Note the core part might include material you |
|
23 have not yet seen in the first two lectures. \bigskip |
|
24 |
|
25 \IMPORTANT{} |
|
26 |
|
27 \noindent |
|
28 Also note that the running time of each part will be restricted to a |
|
29 maximum of 30 seconds on my laptop. |
|
30 |
|
31 \DISCLAIMER{} |
|
32 |
|
33 |
|
34 \subsection*{Reference Implementation} |
|
35 |
|
36 Like the C++ part, the Scala part works like this: you |
|
37 push your files to GitHub and receive (after sometimes a long delay) some |
|
38 automated feedback. In the end we will take a snapshot of the submitted files and |
|
39 apply an automated marking script to them.\medskip |
|
40 |
|
41 \noindent |
|
42 In addition, the Scala part comes with reference |
|
43 implementations in form of \texttt{jar}-files. This allows you to run |
|
44 any test cases on your own computer. For example you can call Scala on |
|
45 the command line with the option \texttt{-cp docdiff.jar} and then |
|
46 query any function from the template file. Say you want to find out |
|
47 what the function \texttt{occurrences} produces: for this you just need |
|
48 to prefix it with the object name \texttt{CW7a} (and \texttt{CW7b} |
|
49 respectively for \texttt{danube.jar}). If you want to find out what |
|
50 these functions produce for the list \texttt{List("a", "b", "b")}, |
|
51 you would type something like: |
|
52 |
|
53 \begin{lstlisting}[language={},numbers=none,basicstyle=\ttfamily\small] |
|
54 $ scala -cp docdiff.jar |
|
55 |
|
56 scala> CW7a.occurrences(List("a", "b", "b")) |
|
57 ... |
|
58 \end{lstlisting}%$ |
|
59 |
|
60 \subsection*{Hints} |
|
61 |
|
62 \noindent |
|
63 \textbf{For Preliminary Part:} useful operations involving regular |
|
64 expressions: |
|
65 \[\texttt{reg.findAllIn(s).toList}\] |
|
66 \noindent finds all |
|
67 substrings in \texttt{s} according to a regular regular expression |
|
68 \texttt{reg}; useful list operations: \texttt{.distinct} |
|
69 removing duplicates from a list, \texttt{.count} counts the number of |
|
70 elements in a list that satisfy some condition, \texttt{.toMap} |
|
71 transfers a list of pairs into a Map, \texttt{.sum} adds up a list of |
|
72 integers, \texttt{.max} calculates the maximum of a list.\bigskip |
|
73 |
|
74 \noindent |
|
75 \textbf{For Core Part:} use \texttt{.split(",").toList} for splitting |
|
76 strings according to commas (similarly $\backslash$\texttt{n}), |
|
77 \texttt{.getOrElse(..,..)} allows to query a Map, but also gives a |
|
78 default value if the Map is not defined, a Map can be `updated' by |
|
79 using \texttt{+}, \texttt{.contains} and \texttt{.filter} can test whether |
|
80 an element is included in a list, and respectively filter out elements in a list, |
|
81 \texttt{.sortBy(\_.\_2)} sorts a list of pairs according to the second |
|
82 elements in the pairs---the sorting is done from smallest to highest, |
|
83 \texttt{.take(n)} for taking some elements in a list (takes fewer if the list |
|
84 contains less than \texttt{n} elements). |
|
85 |
|
86 |
|
87 \newpage |
|
88 \subsection*{Preliminary Part (4 Marks, file docdiff.scala)} |
|
89 |
|
90 It seems source code plagiarism---stealing and submitting someone |
|
91 else's code---is a serious problem at other |
|
92 universities.\footnote{Surely, King's students, after all their |
|
93 instructions and warnings, would never commit such an offence. Yes?} |
|
94 Detecting such plagiarism is time-consuming and disheartening for |
|
95 lecturers at those universities. To aid these poor souls, let's |
|
96 implement in this part a program that determines the similarity |
|
97 between two documents (be they source code or texts in English). A |
|
98 document will be represented as a list of strings. |
|
99 |
|
100 |
|
101 \subsection*{Tasks} |
|
102 |
|
103 \begin{itemize} |
|
104 \item[(1)] Implement a function that `cleans' a string by finding all |
|
105 (proper) words in this string. For this use the regular expression |
|
106 \texttt{\textbackslash{}w+} for recognising words and the library function |
|
107 \texttt{findAllIn}. The function should return a document (a list of |
|
108 strings).\\ |
|
109 \mbox{}\hfill [1 Mark] |
|
110 |
|
111 \item[(2)] In order to compute the overlap between two documents, we |
|
112 associate each document with a \texttt{Map}. This Map represents the |
|
113 strings in a document and how many times these strings occur in the |
|
114 document. A simple (though slightly inefficient) method for counting |
|
115 the number of string-occurrences in a document is as follows: remove |
|
116 all duplicates from the document; for each of these (unique) |
|
117 strings, count how many times they occur in the original document. |
|
118 Return a Map associating strings with occurrences. For example |
|
119 |
|
120 \begin{center} |
|
121 \pcode{occurrences(List("a", "b", "b", "c", "d"))} |
|
122 \end{center} |
|
123 |
|
124 produces \pcode{Map(a -> 1, b -> 2, c -> 1, d -> 1)} and |
|
125 |
|
126 \begin{center} |
|
127 \pcode{occurrences(List("d", "b", "d", "b", "d"))} |
|
128 \end{center} |
|
129 |
|
130 produces \pcode{Map(d -> 3, b -> 2)}.\hfill[1 Mark] |
|
131 |
|
132 \item[(3)] You can think of the Maps calculated under (2) as memory-efficient |
|
133 representations of sparse ``vectors''. In this subtask you need to |
|
134 implement the \emph{product} of two such vectors, sometimes also called |
|
135 \emph{dot product} of two vectors.\footnote{\url{https://en.wikipedia.org/wiki/Dot_product}} |
|
136 |
|
137 For this dot product, implement a function that takes two documents |
|
138 (\texttt{List[String]}) as arguments. The function first calculates |
|
139 the (unique) strings in both. For each string, it multiplies the |
|
140 corresponding occurrences in each document. If a string does not |
|
141 occur in one of the documents, then the product for this string is zero. At the end |
|
142 you need to add up all products. For the two documents in (2) the dot |
|
143 product is 7, because |
|
144 |
|
145 \[ |
|
146 \underbrace{1 * 0}_{"a"} \;\;+\;\; |
|
147 \underbrace{2 * 2}_{"b"} \;\;+\;\; |
|
148 \underbrace{1 * 0}_{"c"} \;\;+\;\; |
|
149 \underbrace{1 * 3}_{"d"} \qquad = 7 |
|
150 \] |
|
151 |
|
152 \hfill\mbox{[1 Mark]} |
|
153 |
|
154 \item[(4)] Implement first a function that calculates the overlap |
|
155 between two documents, say $d_1$ and $d_2$, according to the formula |
|
156 |
|
157 \[ |
|
158 \texttt{overlap}(d_1, d_2) = \frac{d_1 \cdot d_2}{max(d_1^2, d_2^2)} |
|
159 \] |
|
160 |
|
161 where $d_1^2$ means $d_1 \cdot d_1$ and so on. |
|
162 You can expect this function to return a \texttt{Double} between 0 and 1. The |
|
163 overlap between the lists in (2) is $0.5384615384615384$. |
|
164 |
|
165 Second, implement a function that calculates the similarity of |
|
166 two strings, by first extracting the substrings using the clean |
|
167 function from (1) |
|
168 and then calculating the overlap of the resulting documents.\\ |
|
169 \mbox{}\hfill\mbox{[1 Mark]} |
|
170 \end{itemize}\bigskip |
|
171 |
|
172 |
|
173 \subsection*{Core Part (6 Marks, file danube.scala)} |
|
174 |
|
175 You are creating Danube.co.uk which you hope will be the next big thing |
|
176 in online movie renting. You know that you can save money by |
|
177 anticipating what movies people will rent; you will pass these savings |
|
178 on to your users by offering a discount if they rent movies that |
|
179 Danube.co.uk recommends. |
|
180 |
|
181 Your task is to generate \emph{two} movie recommendations for every |
|
182 movie a user rents. To do this, you calculate what other |
|
183 renters, who also watched this movie, suggest by giving positive ratings. |
|
184 Of course, some suggestions are more popular than others. You need to find |
|
185 the two most-frequently suggested movies. Return fewer recommendations, |
|
186 if there are fewer movies suggested. |
|
187 |
|
188 The calculations will be based on the small datasets which the research lab |
|
189 GroupLens provides for education and development purposes. |
|
190 |
|
191 \begin{center} |
|
192 \url{https://grouplens.org/datasets/movielens/} |
|
193 \end{center} |
|
194 |
|
195 \noindent |
|
196 The slightly adapted CSV-files should be downloaded in your Scala |
|
197 file from the URLs: |
|
198 |
|
199 |
|
200 \begin{center} |
|
201 \begin{tabular}{ll} |
|
202 \url{https://nms.kcl.ac.uk/christian.urban/ratings.csv} & (940 KByte)\\ |
|
203 \url{https://nms.kcl.ac.uk/christian.urban/movies.csv} & (280 KByte)\\ |
|
204 \end{tabular} |
|
205 \end{center} |
|
206 |
|
207 \noindent |
|
208 The ratings.csv file is organised as userID, |
|
209 movieID, and rating (which is between 0 and 5, with \emph{positive} ratings |
|
210 being 4 and 5). The file movie.csv is organised as |
|
211 movieID and full movie name. Both files still contain the usual |
|
212 CSV-file header (first line). In this part you are asked |
|
213 to implement functions that process these files. If bandwidth |
|
214 is an issue for you, download the files locally, but in the submitted |
|
215 version use \texttt{Source.fromURL} instead of \texttt{Source.fromFile}. |
|
216 |
|
217 \subsection*{Tasks} |
|
218 |
|
219 \begin{itemize} |
|
220 \item[(1)] Implement the function \pcode{get_csv_url} which takes an |
|
221 URL-string as argument and requests the corresponding file. The two |
|
222 URLs of interest are \pcode{ratings_url} and \pcode{movies_url}, |
|
223 which correspond to CSV-files mentioned above. The function should |
|
224 return the CSV-file appropriately broken up into lines, and the |
|
225 first line should be dropped (that is omit the header of the CSV-file). |
|
226 The result is a list of strings (the lines in the file). In case |
|
227 the url does not produce a file, return the empty list.\\ |
|
228 \mbox{}\hfill [1 Mark] |
|
229 |
|
230 \item[(2)] Implement two functions that process the (broken up) |
|
231 CSV-files from (1). The \pcode{process_ratings} function filters out all |
|
232 ratings below 4 and returns a list of (userID, movieID) pairs. The |
|
233 \pcode{process_movies} function returns a list of (movieID, title) pairs. |
|
234 Note the input to these functions will be the output of the function |
|
235 \pcode{get_csv_url}.\\ |
|
236 \mbox{}\hfill [1 Mark] |
|
237 %\end{itemize} |
|
238 % |
|
239 % |
|
240 %\subsection*{Part 3 (4 Marks, file danube.scala)} |
|
241 % |
|
242 %\subsection*{Tasks} |
|
243 % |
|
244 %\begin{itemize} |
|
245 \item[(3)] Implement a kind of grouping function that calculates a Map |
|
246 containing the userIDs and all the corresponding recommendations for |
|
247 this user (list of movieIDs). This should be implemented in a |
|
248 tail-recursive fashion using a Map as accumulator. This Map is set to |
|
249 \pcode{Map()} at the beginning of the calculation. For example |
|
250 |
|
251 \begin{lstlisting}[numbers=none] |
|
252 val lst = List(("1", "a"), ("1", "b"), |
|
253 ("2", "x"), ("3", "a"), |
|
254 ("2", "y"), ("3", "c")) |
|
255 groupById(lst, Map()) |
|
256 \end{lstlisting} |
|
257 |
|
258 returns the ratings map |
|
259 |
|
260 \begin{center} |
|
261 \pcode{Map(1 -> List(b, a), 2 -> List(y, x), 3 -> List(c, a))}. |
|
262 \end{center} |
|
263 |
|
264 \noindent |
|
265 In which order the elements of the list are given is unimportant.\\ |
|
266 \mbox{}\hfill [1 Mark] |
|
267 |
|
268 \item[(4)] Implement a function that takes a ratings map and a movieID |
|
269 as arguments. The function calculates all suggestions containing the |
|
270 given movie in its recommendations. It returns a list of all these |
|
271 recommendations (each of them is a list and needs to have the given |
|
272 movie deleted, otherwise it might happen we recommend the same movie |
|
273 ``back''). For example for the Map from above and the movie |
|
274 \pcode{"y"} we obtain \pcode{List(List("x"))}, and for the movie |
|
275 \pcode{"a"} we get \pcode{List(List("b"), List("c"))}.\\ |
|
276 \mbox{}\hfill [1 Mark] |
|
277 |
|
278 \item[(5)] Implement a suggestions function which takes a ratings map |
|
279 and a movieID as arguments. It calculates all the recommended movies |
|
280 sorted according to the most frequently suggested movie(s) sorted |
|
281 first. This function returns \emph{all} suggested movieIDs as a list of |
|
282 strings.\\ |
|
283 \mbox{}\hfill [1 Mark] |
|
284 |
|
285 \item[(6)] |
|
286 Implement then a recommendation function which generates a maximum |
|
287 of two most-suggested movies (as calculated above). But it returns |
|
288 the actual movie name, not the movieID. If fewer movies are recommended, |
|
289 then return fewer than two movie names.\\ |
|
290 \mbox{}\hfill [1 Mark] |
|
291 \end{itemize} |
|
292 |
|
293 \end{document} |
|
294 |
|
295 %%% Local Variables: |
|
296 %%% mode: latex |
|
297 %%% TeX-master: t |
|
298 %%% End: |
|