pep-material: cws/main_cw02.tex@4edc1a308652



% !TEX program = xelatex
\documentclass{article}
\usepackage{../style}
\usepackage{disclaimer}
\usepackage{../langs}

\begin{document}


%% should ask to lower case the words.

\section*{Main Part 2 (Scala, 6 Marks)}

\mbox{}\hfill\textit{``C makes it easy to shoot yourself in the foot; C++ makes it harder,}\\
\mbox{}\hfill\textit{ but when you do, it blows your whole leg off.''}\smallskip\\
\mbox{}\hfill\textit{ --- Bjarne Stroustrup (creator of the C++ language)}\bigskip\bigskip




\noindent
You are asked to implement a Scala program for recommending movies
according to a ratings list.\bigskip

\IMPORTANTNONE{}

\noindent
Also note that the running time of each part will be restricted to a
maximum of 30 seconds on my laptop.

\DISCLAIMER{}


\subsection*{Reference Implementation}

Like the C++ part, the Scala part works like this: you push your files
to GitHub and receive (after sometimes a long delay) some automated
feedback. In the end we will take a snapshot of the submitted files
and apply an automated marking script to them.\medskip

\noindent
In addition, the Scala part comes with reference
implementations in form of \texttt{jar}-files. This allows you to run
any test cases on your own computer. For example you can call Scala on
the command line with the option \texttt{-cp danube.jar} and then
query any function from the template file. Say you want to find out
what the function \texttt{} produces: for this you just need
to prefix it with the object name \texttt{M2}.  If you want to find out what
these functions produce for the list \texttt{List("a", "b", "b")},
you would type something like:

\begin{lstlisting}[language={},numbers=none,basicstyle=\ttfamily\small]
$ scala -cp danube.jar
scala> val ratings_url =
     | """https://nms.kcl.ac.uk/christian.urban/ratings.csv"""

scala> M2.get_csv_url(ratings_url)
val res0: List[String] = List(1,1,4 ...)
\end{lstlisting}%$

\subsection*{Hints}

\noindent
Use \texttt{.split(",").toList} for splitting
strings according to commas (similarly for the newline character \mbox{$\backslash$\texttt{n}}),
\texttt{.getOrElse(..,..)} allows to query a Map, but also gives a
default value if the Map is not defined, a Map can be `updated' by
using \texttt{+}, \texttt{.contains} and \texttt{.filter} can test whether
an element is included in a list, and respectively filter out elements in a list,
\texttt{.sortBy(\_.\_2)} sorts a list of pairs according to the second
elements in the pairs---the sorting is done from smallest to highest,
\texttt{.take(n)} for taking some elements in a list (takes fewer if the list
contains less than \texttt{n} elements).


\newpage


\subsection*{Main Part 2 (6 Marks, file danube.scala)}

You are creating Danube.co.uk which you hope will be the next big thing
in online movie renting. You know that you can save money by
anticipating what movies people will rent; you will pass these savings
on to your users by offering a discount if they rent movies that
Danube.co.uk recommends.  

Your task is to generate \emph{two} movie recommendations for every
movie a user rents. To do this, you calculate what other
renters, who also watched this movie, suggest by giving positive ratings.
Of course, some suggestions are more popular than others. You need to find
the two most-frequently suggested movies. Return fewer recommendations,
if there are fewer movies suggested.

The calculations will be based on the small datasets which the research lab
GroupLens provides for education and development purposes.

\begin{center}
\url{https://grouplens.org/datasets/movielens/}
\end{center}

\noindent
The slightly adapted CSV-files should be downloaded in your Scala
file from the URLs:


\begin{center}
\begin{tabular}{ll}  
  \url{https://nms.kcl.ac.uk/christian.urban/ratings.csv} & (940 KByte)\\
  \url{https://nms.kcl.ac.uk/christian.urban/movies.csv}  & (280 KByte)\\
\end{tabular}
\end{center}

\noindent
The ratings.csv file is organised as userID, 
movieID, and rating (which is between 0 and 5, with \emph{positive} ratings
being 4 and 5). The file movie.csv is organised as
movieID and full movie name. Both files still contain the usual
CSV-file header (first line). In this part you are asked
to implement functions that process these files. If bandwidth
is an issue for you, download the files locally, but in the submitted
version use \texttt{Source.fromURL} instead of \texttt{Source.fromFile}.

\subsection*{Tasks}

\begin{itemize}
\item[(1)] Implement the function \pcode{get_csv_url} which takes an
  URL-string as argument and requests the corresponding file. The two
  URLs of interest are \pcode{ratings_url} and \pcode{movies_url},
  which correspond to CSV-files mentioned above.  The function should
  return the CSV-file appropriately broken up into lines, and the
  first line should be dropped (that is omit the header of the CSV-file).
  The result is a list of strings (the lines in the file). In case
  the url does not produce a file, return the empty list.\\
  \mbox{}\hfill [1 Mark]

\item[(2)] Implement two functions that process the (broken up)
  CSV-files from (1). The \pcode{process_ratings} function filters out all
  ratings below 4 and returns a list of (userID, movieID) pairs. The
  \pcode{process_movies} function returns a list of (movieID, title) pairs.
  Note the input to these functions will be the output of the function
  \pcode{get_csv_url}.\\
  \mbox{}\hfill [1 Mark]
%\end{itemize}  
%  
%
%\subsection*{Part 3 (4 Marks, file danube.scala)}
%
%\subsection*{Tasks}
%
%\begin{itemize}
\item[(3)] Implement a kind of grouping function that calculates a Map
  containing the userIDs and all the corresponding recommendations for
  this user (list of movieIDs). This should be implemented in a
  tail-recursive fashion using a Map as accumulator. This Map is set to
  \pcode{Map()} at the beginning of the calculation. For example

\begin{lstlisting}[numbers=none]
val lst = List(("1", "a"), ("1", "b"),
               ("2", "x"), ("3", "a"),
               ("2", "y"), ("3", "c"))
groupById(lst, Map())
\end{lstlisting}

returns the ratings map

\begin{center}
  \pcode{Map(1 -> List(b, a), 2 -> List(y, x), 3 -> List(c, a))}.
\end{center}

\noindent
In which order the elements of the list are given is unimportant.\\
\mbox{}\hfill [1 Mark]

\item[(4)] Implement a function that takes a ratings map and a movieID
  as arguments.  The function calculates all suggestions containing the
  given movie in its recommendations. It returns a list of all these
  recommendations (each of them is a list and needs to have the given
  movie deleted, otherwise it might happen we recommend the same movie
  ``back''). For example for the Map from above and the movie
  \pcode{"y"} we obtain \pcode{List(List("x"))}, and for the movie
  \pcode{"a"} we get \pcode{List(List("b"), List("c"))}.\\
  \mbox{}\hfill [1 Mark]

\item[(5)] Implement a suggestions function which takes a ratings map
  and a movieID as arguments. It calculates all the recommended movies
  sorted according to the most frequently suggested movie(s) sorted
  first. This function returns \emph{all} suggested movieIDs as a list of
  strings.\\
  \mbox{}\hfill [1 Mark]

\item[(6)]  
  Implement then a recommendation function which generates a maximum
  of two most-suggested movies (as calculated above). But it returns
  the actual movie name, not the movieID. If fewer movies are recommended,
  then return fewer than two movie names.\\
  \mbox{}\hfill [1 Mark]

%\item[(7)] Calculate the recommendations for all movies according to
% what the recommendations function in (6) produces (this
% can take a few seconds). Put all recommendations into a list 
% (of strings) and count how often the strings occur in
% this list. This produces a list of string-int pairs,
% where the first component is the movie name and the second
% is the number of how many times the movie was recommended. 
% Sort all the pairs according to the number
% of times they were recommended (most recommended movie name 
% first).\\
% \mbox{}\hfill [1 Mark]
  
\end{itemize}

\end{document} 

%%% Local Variables: 
%%% mode: latex
%%% TeX-master: t
%%% End:
author	Christian Urban <christian.urban@kcl.ac.uk>
	Sun, 09 Jan 2022 01:06:30 +0000
changeset 420	4edc1a308652
parent 415	fced9a61c881
child 426	b51467741af2
permissions	-rw-r--r--