cws/cw02.tex
changeset 349 682611a0fb89
parent 348 b5b6ed38c2f2
child 350 c5ad0e3f2a6d
equal deleted inserted replaced
348:b5b6ed38c2f2 349:682611a0fb89
     1 % !TEX program = xelatex
       
     2 \documentclass{article}
       
     3 \usepackage{../style}
       
     4 \usepackage{disclaimer}
       
     5 \usepackage{../langs}
       
     6 
       
     7 \begin{document}
       
     8 
       
     9 
       
    10 %% should ask to lower case the words.
       
    11 
       
    12 \section*{Part 7 (Scala)}
       
    13 
       
    14 \mbox{}\hfill\textit{``What one programmer can do in one month,}\\
       
    15 \mbox{}\hfill\textit{two programmers can do in two months.''}\smallskip\\
       
    16 \mbox{}\hfill\textit{ --- Frederick P.~Brooks (author of The Mythical Man-Month)}\bigskip\medskip
       
    17 
       
    18 \noindent
       
    19 You are asked to implement Scala programs for measuring similarity in
       
    20 texts, and for recommending movies according to a ratings list. The
       
    21 preliminary part~(4\%) is due on \cwSEVEN{} at 4pm; the core part is due
       
    22 on \cwSEVENa{} at 4pm.   Note the core part might include material you
       
    23 have not yet seen in the first two lectures. \bigskip
       
    24 
       
    25 \IMPORTANT{}
       
    26 
       
    27 \noindent
       
    28 Also note that the running time of each part will be restricted to a
       
    29 maximum of 30 seconds on my laptop.
       
    30 
       
    31 \DISCLAIMER{}
       
    32 
       
    33 
       
    34 \subsection*{Reference Implementation}
       
    35 
       
    36 Like the C++ part, the Scala part works like this: you
       
    37 push your files to GitHub and receive (after sometimes a long delay) some
       
    38 automated feedback. In the end we will take a snapshot of the submitted files and
       
    39 apply an automated marking script to them.\medskip
       
    40 
       
    41 \noindent
       
    42 In addition, the Scala part comes with reference
       
    43 implementations in form of \texttt{jar}-files. This allows you to run
       
    44 any test cases on your own computer. For example you can call Scala on
       
    45 the command line with the option \texttt{-cp docdiff.jar} and then
       
    46 query any function from the template file. Say you want to find out
       
    47 what the function \texttt{occurrences} produces: for this you just need
       
    48 to prefix it with the object name \texttt{CW7a} (and \texttt{CW7b}
       
    49 respectively for \texttt{danube.jar}).  If you want to find out what
       
    50 these functions produce for the list \texttt{List("a", "b", "b")},
       
    51 you would type something like:
       
    52 
       
    53 \begin{lstlisting}[language={},numbers=none,basicstyle=\ttfamily\small]
       
    54 $ scala -cp docdiff.jar
       
    55   
       
    56 scala> CW7a.occurrences(List("a", "b", "b"))
       
    57 ...
       
    58 \end{lstlisting}%$
       
    59 
       
    60 \subsection*{Hints}
       
    61 
       
    62 \noindent
       
    63 \textbf{For Preliminary Part:} useful operations involving regular
       
    64 expressions:
       
    65 \[\texttt{reg.findAllIn(s).toList}\]
       
    66 \noindent finds all
       
    67 substrings in \texttt{s} according to a regular regular expression
       
    68 \texttt{reg}; useful list operations: \texttt{.distinct}
       
    69 removing duplicates from a list, \texttt{.count} counts the number of
       
    70 elements in a list that satisfy some condition, \texttt{.toMap}
       
    71 transfers a list of pairs into a Map, \texttt{.sum} adds up a list of
       
    72 integers, \texttt{.max} calculates the maximum of a list.\bigskip
       
    73 
       
    74 \noindent
       
    75 \textbf{For Core Part:} use \texttt{.split(",").toList} for splitting
       
    76 strings according to commas (similarly $\backslash$\texttt{n}),
       
    77 \texttt{.getOrElse(..,..)} allows to query a Map, but also gives a
       
    78 default value if the Map is not defined, a Map can be `updated' by
       
    79 using \texttt{+}, \texttt{.contains} and \texttt{.filter} can test whether
       
    80 an element is included in a list, and respectively filter out elements in a list,
       
    81 \texttt{.sortBy(\_.\_2)} sorts a list of pairs according to the second
       
    82 elements in the pairs---the sorting is done from smallest to highest,
       
    83 \texttt{.take(n)} for taking some elements in a list (takes fewer if the list
       
    84 contains less than \texttt{n} elements).
       
    85 
       
    86 
       
    87 \newpage
       
    88 \subsection*{Preliminary Part (4 Marks, file docdiff.scala)}
       
    89 
       
    90 It seems source code plagiarism---stealing and submitting someone
       
    91 else's code---is a serious problem at other
       
    92 universities.\footnote{Surely, King's students, after all their
       
    93   instructions and warnings, would never commit such an offence. Yes?}
       
    94 Detecting such plagiarism is time-consuming and disheartening for
       
    95 lecturers at those universities. To aid these poor souls, let's
       
    96 implement in this part a program that determines the similarity
       
    97 between two documents (be they source code or texts in English). A
       
    98 document will be represented as a list of strings.
       
    99 
       
   100 
       
   101 \subsection*{Tasks}
       
   102 
       
   103 \begin{itemize}
       
   104 \item[(1)] Implement a function that `cleans' a string by finding all
       
   105   (proper) words in this string. For this use the regular expression
       
   106   \texttt{\textbackslash{}w+} for recognising words and the library function
       
   107   \texttt{findAllIn}. The function should return a document (a list of
       
   108   strings).\\
       
   109   \mbox{}\hfill [1 Mark]
       
   110 
       
   111 \item[(2)] In order to compute the overlap between two documents, we
       
   112   associate each document with a \texttt{Map}. This Map represents the
       
   113   strings in a document and how many times these strings occur in the
       
   114   document. A simple (though slightly inefficient) method for counting
       
   115   the number of string-occurrences in a document is as follows: remove
       
   116   all duplicates from the document; for each of these (unique)
       
   117   strings, count how many times they occur in the original document.
       
   118   Return a Map associating strings with occurrences. For example
       
   119 
       
   120   \begin{center}
       
   121   \pcode{occurrences(List("a", "b", "b", "c", "d"))}
       
   122   \end{center}
       
   123 
       
   124   produces \pcode{Map(a -> 1, b -> 2, c -> 1, d -> 1)} and
       
   125 
       
   126   \begin{center}
       
   127   \pcode{occurrences(List("d", "b", "d", "b", "d"))}
       
   128   \end{center}
       
   129 
       
   130   produces \pcode{Map(d -> 3, b -> 2)}.\hfill[1 Mark]
       
   131 
       
   132 \item[(3)] You can think of the Maps calculated under (2) as memory-efficient
       
   133   representations of sparse ``vectors''. In this subtask you need to
       
   134   implement the \emph{product} of two such vectors, sometimes also called
       
   135   \emph{dot product} of two vectors.\footnote{\url{https://en.wikipedia.org/wiki/Dot_product}}
       
   136 
       
   137   For this dot product, implement a function that takes two documents
       
   138   (\texttt{List[String]}) as arguments. The function first calculates
       
   139   the (unique) strings in both. For each string, it multiplies the
       
   140   corresponding occurrences in each document. If a string does not
       
   141   occur in one of the documents, then the product for this string is zero. At the end
       
   142   you need to add up all products. For the two documents in (2) the dot
       
   143   product is 7, because
       
   144 
       
   145   \[
       
   146     \underbrace{1 * 0}_{"a"} \;\;+\;\;
       
   147     \underbrace{2 * 2}_{"b"} \;\;+\;\;
       
   148     \underbrace{1 * 0}_{"c"} \;\;+\;\;
       
   149     \underbrace{1 * 3}_{"d"} \qquad = 7
       
   150   \]  
       
   151   
       
   152   \hfill\mbox{[1 Mark]}
       
   153 
       
   154 \item[(4)] Implement first a function that calculates the overlap
       
   155   between two documents, say $d_1$ and $d_2$, according to the formula
       
   156 
       
   157   \[
       
   158   \texttt{overlap}(d_1, d_2) = \frac{d_1 \cdot d_2}{max(d_1^2, d_2^2)}  
       
   159   \]
       
   160 
       
   161   where $d_1^2$ means $d_1 \cdot d_1$ and so on.
       
   162   You can expect this function to return a \texttt{Double} between 0 and 1. The
       
   163   overlap between the lists in (2) is $0.5384615384615384$.
       
   164 
       
   165   Second, implement a function that calculates the similarity of
       
   166   two strings, by first extracting the substrings using the clean
       
   167   function from (1)
       
   168   and then calculating the overlap of the resulting documents.\\
       
   169   \mbox{}\hfill\mbox{[1 Mark]}
       
   170 \end{itemize}\bigskip
       
   171 
       
   172 
       
   173 \subsection*{Core Part (6 Marks, file danube.scala)}
       
   174 
       
   175 You are creating Danube.co.uk which you hope will be the next big thing
       
   176 in online movie renting. You know that you can save money by
       
   177 anticipating what movies people will rent; you will pass these savings
       
   178 on to your users by offering a discount if they rent movies that
       
   179 Danube.co.uk recommends.  
       
   180 
       
   181 Your task is to generate \emph{two} movie recommendations for every
       
   182 movie a user rents. To do this, you calculate what other
       
   183 renters, who also watched this movie, suggest by giving positive ratings.
       
   184 Of course, some suggestions are more popular than others. You need to find
       
   185 the two most-frequently suggested movies. Return fewer recommendations,
       
   186 if there are fewer movies suggested.
       
   187 
       
   188 The calculations will be based on the small datasets which the research lab
       
   189 GroupLens provides for education and development purposes.
       
   190 
       
   191 \begin{center}
       
   192 \url{https://grouplens.org/datasets/movielens/}
       
   193 \end{center}
       
   194 
       
   195 \noindent
       
   196 The slightly adapted CSV-files should be downloaded in your Scala
       
   197 file from the URLs:
       
   198 
       
   199 
       
   200 \begin{center}
       
   201 \begin{tabular}{ll}  
       
   202   \url{https://nms.kcl.ac.uk/christian.urban/ratings.csv} & (940 KByte)\\
       
   203   \url{https://nms.kcl.ac.uk/christian.urban/movies.csv}  & (280 KByte)\\
       
   204 \end{tabular}
       
   205 \end{center}
       
   206 
       
   207 \noindent
       
   208 The ratings.csv file is organised as userID, 
       
   209 movieID, and rating (which is between 0 and 5, with \emph{positive} ratings
       
   210 being 4 and 5). The file movie.csv is organised as
       
   211 movieID and full movie name. Both files still contain the usual
       
   212 CSV-file header (first line). In this part you are asked
       
   213 to implement functions that process these files. If bandwidth
       
   214 is an issue for you, download the files locally, but in the submitted
       
   215 version use \texttt{Source.fromURL} instead of \texttt{Source.fromFile}.
       
   216 
       
   217 \subsection*{Tasks}
       
   218 
       
   219 \begin{itemize}
       
   220 \item[(1)] Implement the function \pcode{get_csv_url} which takes an
       
   221   URL-string as argument and requests the corresponding file. The two
       
   222   URLs of interest are \pcode{ratings_url} and \pcode{movies_url},
       
   223   which correspond to CSV-files mentioned above.  The function should
       
   224   return the CSV-file appropriately broken up into lines, and the
       
   225   first line should be dropped (that is omit the header of the CSV-file).
       
   226   The result is a list of strings (the lines in the file). In case
       
   227   the url does not produce a file, return the empty list.\\
       
   228   \mbox{}\hfill [1 Mark]
       
   229 
       
   230 \item[(2)] Implement two functions that process the (broken up)
       
   231   CSV-files from (1). The \pcode{process_ratings} function filters out all
       
   232   ratings below 4 and returns a list of (userID, movieID) pairs. The
       
   233   \pcode{process_movies} function returns a list of (movieID, title) pairs.
       
   234   Note the input to these functions will be the output of the function
       
   235   \pcode{get_csv_url}.\\
       
   236   \mbox{}\hfill [1 Mark]
       
   237 %\end{itemize}  
       
   238 %  
       
   239 %
       
   240 %\subsection*{Part 3 (4 Marks, file danube.scala)}
       
   241 %
       
   242 %\subsection*{Tasks}
       
   243 %
       
   244 %\begin{itemize}
       
   245 \item[(3)] Implement a kind of grouping function that calculates a Map
       
   246   containing the userIDs and all the corresponding recommendations for
       
   247   this user (list of movieIDs). This should be implemented in a
       
   248   tail-recursive fashion using a Map as accumulator. This Map is set to
       
   249   \pcode{Map()} at the beginning of the calculation. For example
       
   250 
       
   251 \begin{lstlisting}[numbers=none]
       
   252 val lst = List(("1", "a"), ("1", "b"),
       
   253                ("2", "x"), ("3", "a"),
       
   254                ("2", "y"), ("3", "c"))
       
   255 groupById(lst, Map())
       
   256 \end{lstlisting}
       
   257 
       
   258 returns the ratings map
       
   259 
       
   260 \begin{center}
       
   261   \pcode{Map(1 -> List(b, a), 2 -> List(y, x), 3 -> List(c, a))}.
       
   262 \end{center}
       
   263 
       
   264 \noindent
       
   265 In which order the elements of the list are given is unimportant.\\
       
   266 \mbox{}\hfill [1 Mark]
       
   267 
       
   268 \item[(4)] Implement a function that takes a ratings map and a movieID
       
   269   as arguments.  The function calculates all suggestions containing the
       
   270   given movie in its recommendations. It returns a list of all these
       
   271   recommendations (each of them is a list and needs to have the given
       
   272   movie deleted, otherwise it might happen we recommend the same movie
       
   273   ``back''). For example for the Map from above and the movie
       
   274   \pcode{"y"} we obtain \pcode{List(List("x"))}, and for the movie
       
   275   \pcode{"a"} we get \pcode{List(List("b"), List("c"))}.\\
       
   276   \mbox{}\hfill [1 Mark]
       
   277 
       
   278 \item[(5)] Implement a suggestions function which takes a ratings map
       
   279   and a movieID as arguments. It calculates all the recommended movies
       
   280   sorted according to the most frequently suggested movie(s) sorted
       
   281   first. This function returns \emph{all} suggested movieIDs as a list of
       
   282   strings.\\
       
   283   \mbox{}\hfill [1 Mark]
       
   284 
       
   285 \item[(6)]  
       
   286   Implement then a recommendation function which generates a maximum
       
   287   of two most-suggested movies (as calculated above). But it returns
       
   288   the actual movie name, not the movieID. If fewer movies are recommended,
       
   289   then return fewer than two movie names.\\
       
   290   \mbox{}\hfill [1 Mark]
       
   291 \end{itemize}
       
   292 
       
   293 \end{document} 
       
   294 
       
   295 %%% Local Variables: 
       
   296 %%% mode: latex
       
   297 %%% TeX-master: t
       
   298 %%% End: