34 In addition, the Scala assignments come with a reference |
34 In addition, the Scala assignments come with a reference |
35 implementation in form of a \texttt{jar}-file. This allows you to run |
35 implementation in form of a \texttt{jar}-file. This allows you to run |
36 any test cases on your own computer. For example you can call Scala on |
36 any test cases on your own computer. For example you can call Scala on |
37 the command line with the option \texttt{-cp docdiff.jar} and then |
37 the command line with the option \texttt{-cp docdiff.jar} and then |
38 query any function from the template file. Say you want to find out |
38 query any function from the template file. Say you want to find out |
39 what the function \texttt{occurences} produces: for this you just need |
39 what the function \texttt{occurrences} produces: for this you just need |
40 to prefix it with the object name \texttt{CW7a} (and \texttt{CW7b} |
40 to prefix it with the object name \texttt{CW7a} (and \texttt{CW7b} |
41 respectively for \texttt{danube.jar}). If you want to find out what |
41 respectively for \texttt{danube.jar}). If you want to find out what |
42 these functions produce for the list \texttt{List("a", "b", "b")}, |
42 these functions produce for the list \texttt{List("a", "b", "b")}, |
43 you would type something like: |
43 you would type something like: |
44 |
44 |
45 \begin{lstlisting}[language={},numbers=none,basicstyle=\ttfamily\small] |
45 \begin{lstlisting}[language={},numbers=none,basicstyle=\ttfamily\small] |
46 $ scala -cp docdiff.jar |
46 $ scala -cp docdiff.jar |
47 |
47 |
48 scala> CW7a.occurences(List("a", "b", "b")) |
48 scala> CW7a.occurrences(List("a", "b", "b")) |
49 ... |
49 ... |
50 \end{lstlisting}%$ |
50 \end{lstlisting}%$ |
51 |
51 |
52 \subsection*{Hints} |
52 \subsection*{Hints} |
53 |
53 |
54 |
54 \noindent |
|
55 \textbf{For Part 1:} useful operations involving regular |
|
56 expressions: |
|
57 \[\texttt{reg.findAllIn(s).toList}\] |
|
58 \noindent finds all |
|
59 substrings in \texttt{s} according to a regular regular expression |
|
60 \texttt{reg}; useful list operations: \texttt{.distinct} |
|
61 removing duplicates from a list, \texttt{.count} counts the number of |
|
62 elements in a list that satisfy some condition, \texttt{.toMap} |
|
63 transfers a list of pairs into a Map, \texttt{.sum} adds up a list of |
|
64 integers, \texttt{.max} calculates the maximum of a list.\bigskip |
|
65 |
|
66 \noindent |
|
67 \textbf{For Part 2 + 3:} use \texttt{.split(",").toList} for splitting |
|
68 strings according to commas (similarly $\backslash$\texttt{n}), |
|
69 \texttt{.getOrElse(..,..)} allows to querry a Map, but also gives a |
|
70 default value if the Map is not defined, a Map can be `updated' by |
|
71 using \texttt{+}, \texttt{.contains} and \texttt{.filter} can test whether |
|
72 an element is included in a list, and respectively filter out elements in a list, |
|
73 \texttt{.sortBy(\_.\_2)} sorts a list of pairs according to the second |
|
74 elements in the pairs---the sorting is done from smallest to highest, |
|
75 \texttt{.take(n)} for taking some elements in a list (takes fewer if the list |
|
76 contains less than \texttt{n} elements). |
55 |
77 |
56 |
78 |
57 \newpage |
79 \newpage |
58 \subsection*{Part 1 (4 Marks, file docdiff.scala)} |
80 \subsection*{Part 1 (4 Marks, file docdiff.scala)} |
59 |
81 |
60 It seems source code plagiarism---stealing someone else's code---is a |
82 It seems source code plagiarism---stealing and submitting someone |
61 serious problem at other universities.\footnote{Surely, King's |
83 else's code---is a serious problem at other |
62 students, after all their instructions and warnings, would never |
84 universities.\footnote{Surely, King's students, after all their |
63 commit such an offence.} Dedecting such plagiarism is time-consuming |
85 instructions and warnings, would never commit such an offence. Yes?} |
64 and disheartening. To aid the poor lecturers at other universities, |
86 Detecting such plagiarism is time-consuming and disheartening for |
65 let's implement a program that determines the similarity between two |
87 lecturers at those universities. To aid these poor souls, let's |
66 documents (be they code or English texts). A document will be |
88 implement in this part a program that determines the similarity |
67 represented as a list of strings. |
89 between two documents (be they source code or texts in English). A |
|
90 document will be represented as a list of strings. |
68 |
91 |
69 |
92 |
70 \subsection*{Tasks} |
93 \subsection*{Tasks} |
71 |
94 |
72 \begin{itemize} |
95 \begin{itemize} |
73 \item[(1)] Implement a function that cleans a string by finding all |
96 \item[(1)] Implement a function that `cleans' a string by finding all |
74 words in this string. For this use the regular expression |
97 (proper) words in this string. For this use the regular expression |
75 \texttt{"$\backslash$w+"} and the library function |
98 \texttt{$\backslash$w+} for recognising word characters and the |
76 \texttt{findAllIn}. The function should return a list of |
99 library function \texttt{findAllIn}. The function should return a |
77 strings.\\ |
100 document (a list of |
78 \mbox{}\hfill [1 Mark] |
101 strings).\\ |
79 |
102 \mbox{}\hfill [1 Mark] |
80 \item[(2)] In order to compute the similarity between two documents, we |
103 |
|
104 \item[(2)] In order to compute the overlap between two documents, we |
81 associate each document with a \texttt{Map}. This Map represents the |
105 associate each document with a \texttt{Map}. This Map represents the |
82 strings in a document and how many times these strings occur in a |
106 strings in a document and how many times these strings occur in the |
83 document. A simple (though slightly inefficient) method for counting |
107 document. A simple (though slightly inefficient) method for counting |
84 the number of string-occurences in a document is as follows: remove |
108 the number of string-occurrences in a document is as follows: remove |
85 all duplicates from the document; for each of these (unique) |
109 all duplicates from the document; for each of these (unique) |
86 strings, count how many times they occur in the original document. |
110 strings, count how many times they occur in the original document. |
87 Return a Map from strings to occurences. For example |
111 Return a Map associating strings with occurrences. For example |
88 |
112 |
89 \begin{center} |
113 \begin{center} |
90 \pcode{occurences(List("a", "b", "b", "c", "d"))} |
114 \pcode{occurrences(List("a", "b", "b", "c", "d"))} |
91 \end{center} |
115 \end{center} |
92 |
116 |
93 produces \pcode{Map(a -> 1, b -> 2, c -> 1, d -> 1)} and |
117 produces \pcode{Map(a -> 1, b -> 2, c -> 1, d -> 1)} and |
94 |
118 |
95 \begin{center} |
119 \begin{center} |
96 \pcode{occurences(List("d", "b", "d", "b", "d"))} |
120 \pcode{occurrences(List("d", "b", "d", "b", "d"))} |
97 \end{center} |
121 \end{center} |
98 |
122 |
99 produces \pcode{Map(d -> 3, b -> 2)}.\hfill[1 Mark] |
123 produces \pcode{Map(d -> 3, b -> 2)}.\hfill[1 Mark] |
100 |
124 |
101 \item[(3)] You can think of the Maps calculated under (2) as efficient |
125 \item[(3)] You can think of the Maps calculated under (2) as memory-efficient |
102 representations of sparse ``vectors''. In this subtask you need to |
126 representations of sparse ``vectors''. In this subtask you need to |
103 implement the \emph{product} of two vectors, sometimes also called |
127 implement the \emph{product} of two such vectors, sometimes also called |
104 \emph{dot product}.\footnote{\url{https://en.wikipedia.org/wiki/Dot_product}} |
128 \emph{dot product} of two vectors.\footnote{\url{https://en.wikipedia.org/wiki/Dot_product}} |
105 |
129 |
106 For this implement a function that takes two documents |
130 For this dot product, implement a function that takes two documents |
107 (\texttt{List[String]}) as arguments. The function first calculates |
131 (\texttt{List[String]}) as arguments. The function first calculates |
108 the (unique) strings in both. For each string, it multiplies the |
132 the (unique) strings in both. For each string, it multiplies the |
109 occurences in each document. If a string does not occur in one of the |
133 corresponding occurrences in each document. If a string does not |
110 documents, then the product is zero. At the end you |
134 occur in one of the documents, then the product for this string is zero. At the end |
111 sum all products. For the two documents in (2) the dot product is 7: |
135 you need to add up all products. For the two documents in (2) the dot |
|
136 product is 7, because |
112 |
137 |
113 \[ |
138 \[ |
114 \underbrace{1 * 0}_{"a"} \;\;+\;\; |
139 \underbrace{1 * 0}_{"a"} \;\;+\;\; |
115 \underbrace{2 * 2}_{"b"} \;\;+\;\; |
140 \underbrace{2 * 2}_{"b"} \;\;+\;\; |
116 \underbrace{1 * 0}_{"c"} \;\;+\;\; |
141 \underbrace{1 * 0}_{"c"} \;\;+\;\; |
117 \underbrace{1 * 3}_{"d"} |
142 \underbrace{1 * 3}_{"d"} \qquad = 7 |
118 \] |
143 \] |
119 |
144 |
120 \hfill\mbox{[1 Mark]} |
145 \hfill\mbox{[1 Mark]} |
121 |
146 |
122 \item[(4)] Implement first a function that calculates the overlap |
147 \item[(4)] Implement first a function that calculates the overlap |
124 |
149 |
125 \[ |
150 \[ |
126 \texttt{overlap}(d_1, d_2) = \frac{d_1 \cdot d_2}{max(d_1^2, d_2^2)} |
151 \texttt{overlap}(d_1, d_2) = \frac{d_1 \cdot d_2}{max(d_1^2, d_2^2)} |
127 \] |
152 \] |
128 |
153 |
129 This function should return a \texttt{Double} between 0 and 1. The |
154 You can expect this function to return a \texttt{Double} between 0 and 1. The |
130 overlap between the lists in (2) is $0.5384615384615384$. |
155 overlap between the lists in (2) is $0.5384615384615384$. |
131 |
156 |
132 Second implement a function that calculates the similarity of |
157 Second, implement a function that calculates the similarity of |
133 two strings, by first extracting the strings using the function from (1) |
158 two strings, by first extracting the substrings using the clean |
134 and then calculating the overlap. |
159 function from (1) |
135 \hfill\mbox{[1 Mark]} |
160 and then calculating the overlap of the resulting documents.\\ |
|
161 \mbox{}\hfill\mbox{[1 Mark]} |
|
162 \end{itemize}\bigskip |
|
163 |
|
164 |
|
165 \subsection*{Part 2 (2 Marks, file danube.scala)} |
|
166 |
|
167 You are creating Danube.co.uk which you hope will be the next big thing |
|
168 in online movie renting. You know that you can save money by |
|
169 anticipating what movies people will rent; you will pass these savings |
|
170 on to your users by offering a discount if they rent movies that |
|
171 Danube.co.uk recommends. |
|
172 |
|
173 Your task is to generate \emph{two} movie recommendations for every |
|
174 movie a user rents. To do this, you calculate what other |
|
175 renters, who also watched this movie, suggest by giving positive ratings. |
|
176 Of course, some suggestions are more popular than others. You need to find |
|
177 the two most-frequently suggested movies. Return fewer recommendations, |
|
178 if there are fewer movies suggested. |
|
179 |
|
180 The calculations will be based on the small datasets which the research lab |
|
181 GroupLens provides for education and development purposes. |
|
182 |
|
183 \begin{center} |
|
184 \url{https://grouplens.org/datasets/movielens/} |
|
185 \end{center} |
|
186 |
|
187 \noindent |
|
188 The slightly adapted CSV-files should be downloaded in your Scala |
|
189 file from the URLs: |
|
190 |
|
191 |
|
192 \begin{center} |
|
193 \begin{tabular}{ll} |
|
194 \url{https://nms.kcl.ac.uk/christian.urban/ratings.csv} & (940 KByte)\\ |
|
195 \url{https://nms.kcl.ac.uk/christian.urban/movies.csv} & (280 KByte)\\ |
|
196 \end{tabular} |
|
197 \end{center} |
|
198 |
|
199 \noindent |
|
200 The ratings.csv file is organised as userID, |
|
201 movieID, and rating (which is between 0 and 5, with \emph{positive} ratings |
|
202 being 4 and 5). The file movie.csv is organised as |
|
203 movieID and full movie name. Both files still contain the usual |
|
204 CSV-file header (first line). In this part you are asked |
|
205 to implement functions that process these files. If bandwidth |
|
206 is an issue for you, download the files locally, but in the submitted |
|
207 version use \texttt{Source.fromURL} instead of \texttt{Source.fromFile}. |
|
208 |
|
209 \subsection*{Tasks} |
|
210 |
|
211 \begin{itemize} |
|
212 \item[(1)] Implement the function \pcode{get_csv_url} which takes an |
|
213 URL-string as argument and requests the corresponding file. The two |
|
214 URLs of interest are \pcode{ratings_url} and \pcode{movies_url}, |
|
215 which correspond to CSV-files mentioned above. The function should |
|
216 return the CSV-file appropriately broken up into lines, and the |
|
217 first line should be dropped (that is omit the header of the CSV-file). |
|
218 The result is a list of strings (the lines in the file). In case |
|
219 the url does not produce a file, return the empty list.\\ |
|
220 \mbox{}\hfill [1 Mark] |
|
221 |
|
222 \item[(2)] Implement two functions that process the (broken up) |
|
223 CSV-files from (1). The \pcode{process_ratings} function filters out all |
|
224 ratings below 4 and returns a list of (userID, movieID) pairs. The |
|
225 \pcode{process_movies} function returns a list of (movieID, title) pairs.\\ |
|
226 \mbox{}\hfill [1 Mark] |
|
227 \end{itemize} |
|
228 |
|
229 |
|
230 \subsection*{Part 3 (4 Marks, file danube.scala)} |
|
231 |
|
232 \subsection*{Tasks} |
|
233 |
|
234 \begin{itemize} |
|
235 \item[(3)] Implement a kind of grouping function that calculates a Map |
|
236 containing the userIDs and all the corresponding recommendations for |
|
237 this user (list of movieIDs). This should be implemented in a tail |
|
238 recursive fashion using a Map as accumulator. This Map is set to |
|
239 \pcode{Map()} at the beginning of the calculation. For example |
|
240 |
|
241 \begin{lstlisting}[numbers=none] |
|
242 val lst = List(("1", "a"), ("1", "b"), |
|
243 ("2", "x"), ("3", "a"), |
|
244 ("2", "y"), ("3", "c")) |
|
245 groupById(lst, Map()) |
|
246 \end{lstlisting} |
|
247 |
|
248 returns the ratings map |
|
249 |
|
250 \begin{center} |
|
251 \pcode{Map(1 -> List(b, a), 2 -> List(y, x), 3 -> List(c, a))}. |
|
252 \end{center} |
|
253 |
|
254 \noindent |
|
255 In which order the elements of the list are given is unimportant.\\ |
|
256 \mbox{}\hfill [1 Mark] |
|
257 |
|
258 \item[(4)] Implement a function that takes a ratings map and a movieID |
|
259 as argument. The function calculates all suggestions containing the |
|
260 agiven movie in its recommendations. It returns a list of all these |
|
261 recommendations (each of them is a list and needs to have the given |
|
262 movie deleted, otherwise it might happen we recommend the same movie |
|
263 ``back''). For example for the Map from above and the movie |
|
264 \pcode{"y"} we obtain \pcode{List(List("x"))}, and for the movie |
|
265 \pcode{"a"} we get \pcode{List(List("b"), List("c"))}.\\ |
|
266 \mbox{}\hfill [1 Mark] |
|
267 |
|
268 \item[(5)] Implement a suggestions function which takes a ratings map |
|
269 and a movieID as arguments. It calculates all the recommended movies |
|
270 sorted according to the most frequently suggested movie(s) sorted |
|
271 first. This function returns \emph{all} suggested movieIDs as a list of |
|
272 strings.\\ |
|
273 \mbox{}\hfill [1 Mark] |
|
274 |
|
275 \item[(6)] |
|
276 Implement then a recommendation function which generates a maximum |
|
277 of two most-suggested movies (as calculated above). But it returns |
|
278 the actual movie name, not the movieID. If fewer movies are recommended, |
|
279 then return fewer than two movie names.\\ |
|
280 \mbox{}\hfill [1 Mark] |
136 \end{itemize} |
281 \end{itemize} |
137 |
|
138 |
|
139 |
|
140 \newpage |
|
141 You are creating Danube.org, which you hope will be the next big thing |
|
142 in online movie provider. You know that you can save money by |
|
143 anticipating what movies people will rent; you will pass these savings |
|
144 on to your users by offering a discount if they rent movies that Danube.org |
|
145 recommends. This assignment is meant to calculate |
|
146 |
|
147 |
|
148 To do this, you offer an incentive for people to upload their lists of |
|
149 recommended books. From their lists, you can establish suggested |
|
150 pairs. A pair of books is a suggested pair if both books appear on one |
|
151 person’s recommendation list. Of course, some suggested pairs are more |
|
152 popular than others. Also, any given book is paired with some books |
|
153 much more frequently than with others. |
|
154 |
|
155 |
|
156 |
|
157 |
282 |
158 \end{document} |
283 \end{document} |
159 |
284 |
160 %%% Local Variables: |
285 %%% Local Variables: |
161 %%% mode: latex |
286 %%% mode: latex |