cws/cw02.tex
changeset 203 eb188f9ac038
parent 202 f7bcb27d1940
child 210 63a1376cbebd
equal deleted inserted replaced
202:f7bcb27d1940 203:eb188f9ac038
     4 \usepackage{../langs}
     4 \usepackage{../langs}
     5 
     5 
     6 \begin{document}
     6 \begin{document}
     7 
     7 
     8 
     8 
     9 \section*{Coursework 7 (DocDiff and Danube.org)}
     9 \section*{Coursework 7 (Scala)}
    10 
    10 
    11 This coursework is worth 10\%. The first part and second part are due
    11 This coursework is worth 10\%. The first and second part are due
    12 on 22 November at 11pm; the third, more advanced part, is due on 21
    12 on 22 November at 11pm; the third, more advanced part, is due on 21
    13 December at 11pm. You are asked to implement Scala programs for
    13 December at 11pm. You are asked to implement Scala programs for
    14 measuring similarity in texts and for recommending movies
    14 measuring similarity in texts, and for recommending movies
    15 according to a ratings list.  Note the second part might include
    15 according to a ratings list.  Note the second part might include
    16 material you have not yet seen in the first two lectures. \bigskip
    16 material you have not yet seen in the first two lectures. \bigskip
    17 
    17 
    18 \IMPORTANT{}
    18 \IMPORTANT{}
    19 
    19 
    34 In addition, the Scala assignments come with a reference
    34 In addition, the Scala assignments come with a reference
    35 implementation in form of a \texttt{jar}-file. This allows you to run
    35 implementation in form of a \texttt{jar}-file. This allows you to run
    36 any test cases on your own computer. For example you can call Scala on
    36 any test cases on your own computer. For example you can call Scala on
    37 the command line with the option \texttt{-cp docdiff.jar} and then
    37 the command line with the option \texttt{-cp docdiff.jar} and then
    38 query any function from the template file. Say you want to find out
    38 query any function from the template file. Say you want to find out
    39 what the function \texttt{occurences} produces: for this you just need
    39 what the function \texttt{occurrences} produces: for this you just need
    40 to prefix it with the object name \texttt{CW7a} (and \texttt{CW7b}
    40 to prefix it with the object name \texttt{CW7a} (and \texttt{CW7b}
    41 respectively for \texttt{danube.jar}).  If you want to find out what
    41 respectively for \texttt{danube.jar}).  If you want to find out what
    42 these functions produce for the list \texttt{List("a", "b", "b")},
    42 these functions produce for the list \texttt{List("a", "b", "b")},
    43 you would type something like:
    43 you would type something like:
    44 
    44 
    45 \begin{lstlisting}[language={},numbers=none,basicstyle=\ttfamily\small]
    45 \begin{lstlisting}[language={},numbers=none,basicstyle=\ttfamily\small]
    46 $ scala -cp docdiff.jar
    46 $ scala -cp docdiff.jar
    47   
    47   
    48 scala> CW7a.occurences(List("a", "b", "b"))
    48 scala> CW7a.occurrences(List("a", "b", "b"))
    49 ...
    49 ...
    50 \end{lstlisting}%$
    50 \end{lstlisting}%$
    51 
    51 
    52 \subsection*{Hints}
    52 \subsection*{Hints}
    53 
    53 
    54 
    54 \noindent
       
    55 \textbf{For Part 1:} useful operations involving regular
       
    56 expressions:
       
    57 \[\texttt{reg.findAllIn(s).toList}\]
       
    58 \noindent finds all
       
    59 substrings in \texttt{s} according to a regular regular expression
       
    60 \texttt{reg}; useful list operations: \texttt{.distinct}
       
    61 removing duplicates from a list, \texttt{.count} counts the number of
       
    62 elements in a list that satisfy some condition, \texttt{.toMap}
       
    63 transfers a list of pairs into a Map, \texttt{.sum} adds up a list of
       
    64 integers, \texttt{.max} calculates the maximum of a list.\bigskip
       
    65 
       
    66 \noindent
       
    67 \textbf{For Part 2 + 3:} use \texttt{.split(",").toList} for splitting
       
    68 strings according to commas (similarly $\backslash$\texttt{n}),
       
    69 \texttt{.getOrElse(..,..)} allows to querry a Map, but also gives a
       
    70 default value if the Map is not defined, a Map can be `updated' by
       
    71 using \texttt{+}, \texttt{.contains} and \texttt{.filter} can test whether
       
    72 an element is included in a list, and respectively filter out elements in a list,
       
    73 \texttt{.sortBy(\_.\_2)} sorts a list of pairs according to the second
       
    74 elements in the pairs---the sorting is done from smallest to highest,
       
    75 \texttt{.take(n)} for taking some elements in a list (takes fewer if the list
       
    76 contains less than \texttt{n} elements).
    55 
    77 
    56 
    78 
    57 \newpage
    79 \newpage
    58 \subsection*{Part 1 (4 Marks, file docdiff.scala)}
    80 \subsection*{Part 1 (4 Marks, file docdiff.scala)}
    59 
    81 
    60 It seems source code plagiarism---stealing someone else's code---is a
    82 It seems source code plagiarism---stealing and submitting someone
    61 serious problem at other universities.\footnote{Surely, King's
    83 else's code---is a serious problem at other
    62   students, after all their instructions and warnings, would never
    84 universities.\footnote{Surely, King's students, after all their
    63   commit such an offence.} Dedecting such plagiarism is time-consuming
    85   instructions and warnings, would never commit such an offence. Yes?}
    64 and disheartening. To aid the poor lecturers at other universities,
    86 Detecting such plagiarism is time-consuming and disheartening for
    65 let's implement a program that determines the similarity between two
    87 lecturers at those universities. To aid these poor souls, let's
    66 documents (be they code or English texts). A document will be
    88 implement in this part a program that determines the similarity
    67 represented as a list of strings.
    89 between two documents (be they source code or texts in English). A
       
    90 document will be represented as a list of strings.
    68 
    91 
    69 
    92 
    70 \subsection*{Tasks}
    93 \subsection*{Tasks}
    71 
    94 
    72 \begin{itemize}
    95 \begin{itemize}
    73 \item[(1)] Implement a function that cleans a string by finding all
    96 \item[(1)] Implement a function that `cleans' a string by finding all
    74   words in this string. For this use the regular expression
    97   (proper) words in this string. For this use the regular expression
    75   \texttt{"$\backslash$w+"} and the library function
    98   \texttt{$\backslash$w+} for recognising word characters and the
    76   \texttt{findAllIn}. The function should return a list of
    99   library function \texttt{findAllIn}. The function should return a
    77   strings.\\
   100   document (a list of
    78   \mbox{}\hfill [1 Mark]
   101   strings).\\
    79 
   102   \mbox{}\hfill [1 Mark]
    80 \item[(2)] In order to compute the similarity between two documents, we
   103 
       
   104 \item[(2)] In order to compute the overlap between two documents, we
    81   associate each document with a \texttt{Map}. This Map represents the
   105   associate each document with a \texttt{Map}. This Map represents the
    82   strings in a document and how many times these strings occur in a
   106   strings in a document and how many times these strings occur in the
    83   document. A simple (though slightly inefficient) method for counting
   107   document. A simple (though slightly inefficient) method for counting
    84   the number of string-occurences in a document is as follows: remove
   108   the number of string-occurrences in a document is as follows: remove
    85   all duplicates from the document; for each of these (unique)
   109   all duplicates from the document; for each of these (unique)
    86   strings, count how many times they occur in the original document.
   110   strings, count how many times they occur in the original document.
    87   Return a Map from strings to occurences. For example
   111   Return a Map associating strings with occurrences. For example
    88 
   112 
    89   \begin{center}
   113   \begin{center}
    90   \pcode{occurences(List("a", "b", "b", "c", "d"))}
   114   \pcode{occurrences(List("a", "b", "b", "c", "d"))}
    91   \end{center}
   115   \end{center}
    92 
   116 
    93   produces \pcode{Map(a -> 1, b -> 2, c -> 1, d -> 1)} and
   117   produces \pcode{Map(a -> 1, b -> 2, c -> 1, d -> 1)} and
    94 
   118 
    95   \begin{center}
   119   \begin{center}
    96   \pcode{occurences(List("d", "b", "d", "b", "d"))}
   120   \pcode{occurrences(List("d", "b", "d", "b", "d"))}
    97   \end{center}
   121   \end{center}
    98 
   122 
    99   produces \pcode{Map(d -> 3, b -> 2)}.\hfill[1 Mark]
   123   produces \pcode{Map(d -> 3, b -> 2)}.\hfill[1 Mark]
   100 
   124 
   101 \item[(3)] You can think of the Maps calculated under (2) as efficient
   125 \item[(3)] You can think of the Maps calculated under (2) as memory-efficient
   102   representations of sparse ``vectors''. In this subtask you need to
   126   representations of sparse ``vectors''. In this subtask you need to
   103   implement the \emph{product} of two vectors, sometimes also called
   127   implement the \emph{product} of two such vectors, sometimes also called
   104   \emph{dot product}.\footnote{\url{https://en.wikipedia.org/wiki/Dot_product}}
   128   \emph{dot product} of two vectors.\footnote{\url{https://en.wikipedia.org/wiki/Dot_product}}
   105 
   129 
   106   For this implement a function that takes two documents
   130   For this dot product, implement a function that takes two documents
   107   (\texttt{List[String]}) as arguments. The function first calculates
   131   (\texttt{List[String]}) as arguments. The function first calculates
   108   the (unique) strings in both. For each string, it multiplies the
   132   the (unique) strings in both. For each string, it multiplies the
   109   occurences in each document. If a string does not occur in one of the
   133   corresponding occurrences in each document. If a string does not
   110   documents, then the product is zero. At the end you
   134   occur in one of the documents, then the product for this string is zero. At the end
   111   sum all products. For the two documents in (2) the dot product is 7:
   135   you need to add up all products. For the two documents in (2) the dot
       
   136   product is 7, because
   112 
   137 
   113   \[
   138   \[
   114     \underbrace{1 * 0}_{"a"} \;\;+\;\;
   139     \underbrace{1 * 0}_{"a"} \;\;+\;\;
   115     \underbrace{2 * 2}_{"b"} \;\;+\;\;
   140     \underbrace{2 * 2}_{"b"} \;\;+\;\;
   116     \underbrace{1 * 0}_{"c"} \;\;+\;\;
   141     \underbrace{1 * 0}_{"c"} \;\;+\;\;
   117     \underbrace{1 * 3}_{"d"}
   142     \underbrace{1 * 3}_{"d"} \qquad = 7
   118   \]  
   143   \]  
   119   
   144   
   120   \hfill\mbox{[1 Mark]}
   145   \hfill\mbox{[1 Mark]}
   121 
   146 
   122 \item[(4)] Implement first a function that calculates the overlap
   147 \item[(4)] Implement first a function that calculates the overlap
   124 
   149 
   125   \[
   150   \[
   126   \texttt{overlap}(d_1, d_2) = \frac{d_1 \cdot d_2}{max(d_1^2, d_2^2)}  
   151   \texttt{overlap}(d_1, d_2) = \frac{d_1 \cdot d_2}{max(d_1^2, d_2^2)}  
   127   \]
   152   \]
   128 
   153 
   129   This function should return a \texttt{Double} between 0 and 1. The
   154   You can expect this function to return a \texttt{Double} between 0 and 1. The
   130   overlap between the lists in (2) is $0.5384615384615384$.
   155   overlap between the lists in (2) is $0.5384615384615384$.
   131 
   156 
   132   Second implement a function that calculates the similarity of
   157   Second, implement a function that calculates the similarity of
   133   two strings, by first extracting the strings using the function from (1)
   158   two strings, by first extracting the substrings using the clean
   134   and then calculating the overlap.
   159   function from (1)
   135   \hfill\mbox{[1 Mark]}
   160   and then calculating the overlap of the resulting documents.\\
       
   161   \mbox{}\hfill\mbox{[1 Mark]}
       
   162 \end{itemize}\bigskip
       
   163 
       
   164 
       
   165 \subsection*{Part 2 (2 Marks, file danube.scala)}
       
   166 
       
   167 You are creating Danube.co.uk which you hope will be the next big thing
       
   168 in online movie renting. You know that you can save money by
       
   169 anticipating what movies people will rent; you will pass these savings
       
   170 on to your users by offering a discount if they rent movies that
       
   171 Danube.co.uk recommends.  
       
   172 
       
   173 Your task is to generate \emph{two} movie recommendations for every
       
   174 movie a user rents. To do this, you calculate what other
       
   175 renters, who also watched this movie, suggest by giving positive ratings.
       
   176 Of course, some suggestions are more popular than others. You need to find
       
   177 the two most-frequently suggested movies. Return fewer recommendations,
       
   178 if there are fewer movies suggested.
       
   179 
       
   180 The calculations will be based on the small datasets which the research lab
       
   181 GroupLens provides for education and development purposes.
       
   182 
       
   183 \begin{center}
       
   184 \url{https://grouplens.org/datasets/movielens/}
       
   185 \end{center}
       
   186 
       
   187 \noindent
       
   188 The slightly adapted CSV-files should be downloaded in your Scala
       
   189 file from the URLs:
       
   190 
       
   191 
       
   192 \begin{center}
       
   193 \begin{tabular}{ll}  
       
   194   \url{https://nms.kcl.ac.uk/christian.urban/ratings.csv} & (940 KByte)\\
       
   195   \url{https://nms.kcl.ac.uk/christian.urban/movies.csv}  & (280 KByte)\\
       
   196 \end{tabular}
       
   197 \end{center}
       
   198 
       
   199 \noindent
       
   200 The ratings.csv file is organised as userID, 
       
   201 movieID, and rating (which is between 0 and 5, with \emph{positive} ratings
       
   202 being 4 and 5). The file movie.csv is organised as
       
   203 movieID and full movie name. Both files still contain the usual
       
   204 CSV-file header (first line). In this part you are asked
       
   205 to implement functions that process these files. If bandwidth
       
   206 is an issue for you, download the files locally, but in the submitted
       
   207 version use \texttt{Source.fromURL} instead of \texttt{Source.fromFile}.
       
   208 
       
   209 \subsection*{Tasks}
       
   210 
       
   211 \begin{itemize}
       
   212 \item[(1)] Implement the function \pcode{get_csv_url} which takes an
       
   213   URL-string as argument and requests the corresponding file. The two
       
   214   URLs of interest are \pcode{ratings_url} and \pcode{movies_url},
       
   215   which correspond to CSV-files mentioned above.  The function should
       
   216   return the CSV-file appropriately broken up into lines, and the
       
   217   first line should be dropped (that is omit the header of the CSV-file).
       
   218   The result is a list of strings (the lines in the file). In case
       
   219   the url does not produce a file, return the empty list.\\
       
   220   \mbox{}\hfill [1 Mark]
       
   221 
       
   222 \item[(2)] Implement two functions that process the (broken up)
       
   223   CSV-files from (1). The \pcode{process_ratings} function filters out all
       
   224   ratings below 4 and returns a list of (userID, movieID) pairs. The
       
   225   \pcode{process_movies} function returns a list of (movieID, title) pairs.\\
       
   226   \mbox{}\hfill [1 Mark]
       
   227 \end{itemize}  
       
   228   
       
   229 
       
   230 \subsection*{Part 3 (4 Marks, file danube.scala)}
       
   231 
       
   232 \subsection*{Tasks}
       
   233 
       
   234 \begin{itemize}
       
   235 \item[(3)] Implement a kind of grouping function that calculates a Map
       
   236   containing the userIDs and all the corresponding recommendations for
       
   237   this user (list of movieIDs). This should be implemented in a tail
       
   238   recursive fashion using a Map as accumulator. This Map is set to
       
   239   \pcode{Map()} at the beginning of the calculation. For example
       
   240 
       
   241 \begin{lstlisting}[numbers=none]
       
   242 val lst = List(("1", "a"), ("1", "b"),
       
   243                ("2", "x"), ("3", "a"),
       
   244                ("2", "y"), ("3", "c"))
       
   245 groupById(lst, Map())
       
   246 \end{lstlisting}
       
   247 
       
   248 returns the ratings map
       
   249 
       
   250 \begin{center}
       
   251   \pcode{Map(1 -> List(b, a), 2 -> List(y, x), 3 -> List(c, a))}.
       
   252 \end{center}
       
   253 
       
   254 \noindent
       
   255 In which order the elements of the list are given is unimportant.\\
       
   256 \mbox{}\hfill [1 Mark]
       
   257 
       
   258 \item[(4)] Implement a function that takes a ratings map and a movieID
       
   259   as argument.  The function calculates all suggestions containing the
       
   260   agiven movie in its recommendations. It returns a list of all these
       
   261   recommendations (each of them is a list and needs to have the given
       
   262   movie deleted, otherwise it might happen we recommend the same movie
       
   263   ``back''). For example for the Map from above and the movie
       
   264   \pcode{"y"} we obtain \pcode{List(List("x"))}, and for the movie
       
   265   \pcode{"a"} we get \pcode{List(List("b"), List("c"))}.\\
       
   266   \mbox{}\hfill [1 Mark]
       
   267 
       
   268 \item[(5)] Implement a suggestions function which takes a ratings map
       
   269   and a movieID as arguments. It calculates all the recommended movies
       
   270   sorted according to the most frequently suggested movie(s) sorted
       
   271   first. This function returns \emph{all} suggested movieIDs as a list of
       
   272   strings.\\
       
   273   \mbox{}\hfill [1 Mark]
       
   274 
       
   275 \item[(6)]  
       
   276   Implement then a recommendation function which generates a maximum
       
   277   of two most-suggested movies (as calculated above). But it returns
       
   278   the actual movie name, not the movieID. If fewer movies are recommended,
       
   279   then return fewer than two movie names.\\
       
   280   \mbox{}\hfill [1 Mark]
   136 \end{itemize}
   281 \end{itemize}
   137 
       
   138 
       
   139 
       
   140 \newpage
       
   141 You are creating Danube.org, which you hope will be the next big thing
       
   142 in online movie provider. You know that you can save money by
       
   143 anticipating what movies people will rent; you will pass these savings
       
   144 on to your users by offering a discount if they rent movies that Danube.org
       
   145 recommends.  This assignment is meant to calculate 
       
   146 
       
   147 
       
   148 To do this, you offer an incentive for people to upload their lists of
       
   149 recommended books. From their lists, you can establish suggested
       
   150 pairs. A pair of books is a suggested pair if both books appear on one
       
   151 person’s recommendation list. Of course, some suggested pairs are more
       
   152 popular than others. Also, any given book is paired with some books
       
   153 much more frequently than with others.
       
   154 
       
   155 
       
   156 
       
   157 
   282 
   158 \end{document}
   283 \end{document}
   159 
   284 
   160 %%% Local Variables: 
   285 %%% Local Variables: 
   161 %%% mode: latex
   286 %%% mode: latex