cws/cw02.tex
author Christian Urban <urbanc@in.tum.de>
Thu, 22 Nov 2018 23:00:57 +0000
changeset 216 8c868feb917b
parent 210 63a1376cbebd
child 251 3dd550b9b0e3
permissions -rw-r--r--
updated

\documentclass{article}
\usepackage{../style}
\usepackage{disclaimer}
\usepackage{../langs}

\begin{document}


\section*{Coursework 7 (Scala)}

This coursework is worth 10\%. The first and second part are due
on 22 November at 11pm; the third, more advanced part, is due on 21
December at 11pm. You are asked to implement Scala programs for
measuring similarity in texts, and for recommending movies
according to a ratings list.  Note the second part might include
material you have not yet seen in the first two lectures. \bigskip

\IMPORTANT{}

\noindent
Also note that the running time of each part will be restricted to a
maximum of 30 seconds on my laptop.

\DISCLAIMER{}


\subsection*{Reference Implementation}

Like the C++ assignments, the Scala assignments will work like this: you
push your files to GitHub and receive (after sometimes a long delay) some
automated feedback. In the end we take a snapshot of the submitted files and
apply an automated marking script to them.

In addition, the Scala assignments come with a reference
implementation in form of a \texttt{jar}-file. This allows you to run
any test cases on your own computer. For example you can call Scala on
the command line with the option \texttt{-cp docdiff.jar} and then
query any function from the template file. Say you want to find out
what the function \texttt{occurrences} produces: for this you just need
to prefix it with the object name \texttt{CW7a} (and \texttt{CW7b}
respectively for \texttt{danube.jar}).  If you want to find out what
these functions produce for the list \texttt{List("a", "b", "b")},
you would type something like:

\begin{lstlisting}[language={},numbers=none,basicstyle=\ttfamily\small]
$ scala -cp docdiff.jar
  
scala> CW7a.occurrences(List("a", "b", "b"))
...
\end{lstlisting}%$

\subsection*{Hints}

\noindent
\textbf{For Part 1:} useful operations involving regular
expressions:
\[\texttt{reg.findAllIn(s).toList}\]
\noindent finds all
substrings in \texttt{s} according to a regular regular expression
\texttt{reg}; useful list operations: \texttt{.distinct}
removing duplicates from a list, \texttt{.count} counts the number of
elements in a list that satisfy some condition, \texttt{.toMap}
transfers a list of pairs into a Map, \texttt{.sum} adds up a list of
integers, \texttt{.max} calculates the maximum of a list.\bigskip

\noindent
\textbf{For Part 2 + 3:} use \texttt{.split(",").toList} for splitting
strings according to commas (similarly $\backslash$\texttt{n}),
\texttt{.getOrElse(..,..)} allows to querry a Map, but also gives a
default value if the Map is not defined, a Map can be `updated' by
using \texttt{+}, \texttt{.contains} and \texttt{.filter} can test whether
an element is included in a list, and respectively filter out elements in a list,
\texttt{.sortBy(\_.\_2)} sorts a list of pairs according to the second
elements in the pairs---the sorting is done from smallest to highest,
\texttt{.take(n)} for taking some elements in a list (takes fewer if the list
contains less than \texttt{n} elements).


\newpage
\subsection*{Part 1 (4 Marks, file docdiff.scala)}

It seems source code plagiarism---stealing and submitting someone
else's code---is a serious problem at other
universities.\footnote{Surely, King's students, after all their
  instructions and warnings, would never commit such an offence. Yes?}
Detecting such plagiarism is time-consuming and disheartening for
lecturers at those universities. To aid these poor souls, let's
implement in this part a program that determines the similarity
between two documents (be they source code or texts in English). A
document will be represented as a list of strings.


\subsection*{Tasks}

\begin{itemize}
\item[(1)] Implement a function that `cleans' a string by finding all
  (proper) words in this string. For this use the regular expression
  \texttt{$\backslash$w+} for recognising word characters and the
  library function \texttt{findAllIn}. The function should return a
  document (a list of
  strings).\\
  \mbox{}\hfill [1 Mark]

\item[(2)] In order to compute the overlap between two documents, we
  associate each document with a \texttt{Map}. This Map represents the
  strings in a document and how many times these strings occur in the
  document. A simple (though slightly inefficient) method for counting
  the number of string-occurrences in a document is as follows: remove
  all duplicates from the document; for each of these (unique)
  strings, count how many times they occur in the original document.
  Return a Map associating strings with occurrences. For example

  \begin{center}
  \pcode{occurrences(List("a", "b", "b", "c", "d"))}
  \end{center}

  produces \pcode{Map(a -> 1, b -> 2, c -> 1, d -> 1)} and

  \begin{center}
  \pcode{occurrences(List("d", "b", "d", "b", "d"))}
  \end{center}

  produces \pcode{Map(d -> 3, b -> 2)}.\hfill[1 Mark]

\item[(3)] You can think of the Maps calculated under (2) as memory-efficient
  representations of sparse ``vectors''. In this subtask you need to
  implement the \emph{product} of two such vectors, sometimes also called
  \emph{dot product} of two vectors.\footnote{\url{https://en.wikipedia.org/wiki/Dot_product}}

  For this dot product, implement a function that takes two documents
  (\texttt{List[String]}) as arguments. The function first calculates
  the (unique) strings in both. For each string, it multiplies the
  corresponding occurrences in each document. If a string does not
  occur in one of the documents, then the product for this string is zero. At the end
  you need to add up all products. For the two documents in (2) the dot
  product is 7, because

  \[
    \underbrace{1 * 0}_{"a"} \;\;+\;\;
    \underbrace{2 * 2}_{"b"} \;\;+\;\;
    \underbrace{1 * 0}_{"c"} \;\;+\;\;
    \underbrace{1 * 3}_{"d"} \qquad = 7
  \]  
  
  \hfill\mbox{[1 Mark]}

\item[(4)] Implement first a function that calculates the overlap
  between two documents, say $d_1$ and $d_2$, according to the formula

  \[
  \texttt{overlap}(d_1, d_2) = \frac{d_1 \cdot d_2}{max(d_1^2, d_2^2)}  
  \]

  You can expect this function to return a \texttt{Double} between 0 and 1. The
  overlap between the lists in (2) is $0.5384615384615384$.

  Second, implement a function that calculates the similarity of
  two strings, by first extracting the substrings using the clean
  function from (1)
  and then calculating the overlap of the resulting documents.\\
  \mbox{}\hfill\mbox{[1 Mark]}
\end{itemize}\bigskip


\subsection*{Part 2 (2 Marks, file danube.scala)}

You are creating Danube.co.uk which you hope will be the next big thing
in online movie renting. You know that you can save money by
anticipating what movies people will rent; you will pass these savings
on to your users by offering a discount if they rent movies that
Danube.co.uk recommends.  

Your task is to generate \emph{two} movie recommendations for every
movie a user rents. To do this, you calculate what other
renters, who also watched this movie, suggest by giving positive ratings.
Of course, some suggestions are more popular than others. You need to find
the two most-frequently suggested movies. Return fewer recommendations,
if there are fewer movies suggested.

The calculations will be based on the small datasets which the research lab
GroupLens provides for education and development purposes.

\begin{center}
\url{https://grouplens.org/datasets/movielens/}
\end{center}

\noindent
The slightly adapted CSV-files should be downloaded in your Scala
file from the URLs:


\begin{center}
\begin{tabular}{ll}  
  \url{https://nms.kcl.ac.uk/christian.urban/ratings.csv} & (940 KByte)\\
  \url{https://nms.kcl.ac.uk/christian.urban/movies.csv}  & (280 KByte)\\
\end{tabular}
\end{center}

\noindent
The ratings.csv file is organised as userID, 
movieID, and rating (which is between 0 and 5, with \emph{positive} ratings
being 4 and 5). The file movie.csv is organised as
movieID and full movie name. Both files still contain the usual
CSV-file header (first line). In this part you are asked
to implement functions that process these files. If bandwidth
is an issue for you, download the files locally, but in the submitted
version use \texttt{Source.fromURL} instead of \texttt{Source.fromFile}.

\subsection*{Tasks}

\begin{itemize}
\item[(1)] Implement the function \pcode{get_csv_url} which takes an
  URL-string as argument and requests the corresponding file. The two
  URLs of interest are \pcode{ratings_url} and \pcode{movies_url},
  which correspond to CSV-files mentioned above.  The function should
  return the CSV-file appropriately broken up into lines, and the
  first line should be dropped (that is omit the header of the CSV-file).
  The result is a list of strings (the lines in the file). In case
  the url does not produce a file, return the empty list.\\
  \mbox{}\hfill [1 Mark]

\item[(2)] Implement two functions that process the (broken up)
  CSV-files from (1). The \pcode{process_ratings} function filters out all
  ratings below 4 and returns a list of (userID, movieID) pairs. The
  \pcode{process_movies} function returns a list of (movieID, title) pairs.\\
  \mbox{}\hfill [1 Mark]
\end{itemize}  
  

\subsection*{Part 3 (4 Marks, file danube.scala)}

\subsection*{Tasks}

\begin{itemize}
\item[(3)] Implement a kind of grouping function that calculates a Map
  containing the userIDs and all the corresponding recommendations for
  this user (list of movieIDs). This should be implemented in a tail
  recursive fashion using a Map as accumulator. This Map is set to
  \pcode{Map()} at the beginning of the calculation. For example

\begin{lstlisting}[numbers=none]
val lst = List(("1", "a"), ("1", "b"),
               ("2", "x"), ("3", "a"),
               ("2", "y"), ("3", "c"))
groupById(lst, Map())
\end{lstlisting}

returns the ratings map

\begin{center}
  \pcode{Map(1 -> List(b, a), 2 -> List(y, x), 3 -> List(c, a))}.
\end{center}

\noindent
In which order the elements of the list are given is unimportant.\\
\mbox{}\hfill [1 Mark]

\item[(4)] Implement a function that takes a ratings map and a movieID
  as arguments.  The function calculates all suggestions containing the
  given movie in its recommendations. It returns a list of all these
  recommendations (each of them is a list and needs to have the given
  movie deleted, otherwise it might happen we recommend the same movie
  ``back''). For example for the Map from above and the movie
  \pcode{"y"} we obtain \pcode{List(List("x"))}, and for the movie
  \pcode{"a"} we get \pcode{List(List("b"), List("c"))}.\\
  \mbox{}\hfill [1 Mark]

\item[(5)] Implement a suggestions function which takes a ratings map
  and a movieID as arguments. It calculates all the recommended movies
  sorted according to the most frequently suggested movie(s) sorted
  first. This function returns \emph{all} suggested movieIDs as a list of
  strings.\\
  \mbox{}\hfill [1 Mark]

\item[(6)]  
  Implement then a recommendation function which generates a maximum
  of two most-suggested movies (as calculated above). But it returns
  the actual movie name, not the movieID. If fewer movies are recommended,
  then return fewer than two movie names.\\
  \mbox{}\hfill [1 Mark]
\end{itemize}

\end{document}

%%% Local Variables: 
%%% mode: latex
%%% TeX-master: t
%%% End: