cws/main_cw02.tex
changeset 428 cdfa6a293453
parent 426 b51467741af2
child 445 b73e7ce91c10
equal deleted inserted replaced
427:6e93040e3378 428:cdfa6a293453
     2 % !TEX program = xelatex
     2 % !TEX program = xelatex
     3 \documentclass{article}
     3 \documentclass{article}
     4 \usepackage{../styles/style}
     4 \usepackage{../styles/style}
     5 \usepackage{disclaimer}
     5 \usepackage{disclaimer}
     6 \usepackage{../styles/langs}
     6 \usepackage{../styles/langs}
       
     7 \usepackage{graphicx}
     7 
     8 
     8 \begin{document}
     9 \begin{document}
     9 
    10 
    10 
    11 
    11 %% should ask to lower case the words.
    12 %% should ask to lower case the words.
    12 
    13 
    13 \section*{Main Part 2 (Scala, 6 Marks)}
    14 \section*{Evil Wordle Game (Scala, 7 Marks)}
    14 
    15 
    15 \mbox{}\hfill\textit{``C makes it easy to shoot yourself in the foot; C++ makes it harder,}\\
    16 \mbox{}\hfill\textit{``C makes it easy to shoot yourself in the foot; C++ makes it harder,}\\
    16 \mbox{}\hfill\textit{ but when you do, it blows your whole leg off.''}\smallskip\\
    17 \mbox{}\hfill\textit{ but when you do, it blows your whole leg off.''}\smallskip\\
    17 \mbox{}\hfill\textit{ --- Bjarne Stroustrup (creator of the C++ language)}\bigskip\bigskip
    18 \mbox{}\hfill\textit{ --- Bjarne Stroustrup (creator of the C++ language)}\bigskip\bigskip
    18 
    19 
    19 
    20 
    20 
    21 
    21 
    22 
    22 \noindent
    23 \noindent
    23 You are asked to implement a Scala program for recommending movies
    24 You are asked to implement a Scala program for making the popular Wordle game as difficult
    24 according to a ratings list.\bigskip
    25 as possible.\bigskip
    25 
    26 
    26 \IMPORTANTNONE{}
    27 \IMPORTANTNONE{}
    27 
    28 
    28 \noindent
    29 \noindent
    29 Also note that the running time of each part will be restricted to a
    30 Also note that the running time of each part will be restricted to a
    30 maximum of 30 seconds on my laptop.
    31 maximum of 30 seconds on my laptop.
    31 
    32 
    32 \DISCLAIMER{}
    33 \DISCLAIMER{}
    33 
       
    34 
    34 
    35 \subsection*{Reference Implementation}
    35 \subsection*{Reference Implementation}
    36 
    36 
    37 Like the C++ part, the Scala part works like this: you push your files
    37 Like the C++ part, the Scala part works like this: you push your files
    38 to GitHub and receive (after sometimes a long delay) some automated
    38 to GitHub and receive (after sometimes a long delay) some automated
    49 to prefix it with the object name \texttt{M2}.  If you want to find out what
    49 to prefix it with the object name \texttt{M2}.  If you want to find out what
    50 these functions produce for the list \texttt{List("a", "b", "b")},
    50 these functions produce for the list \texttt{List("a", "b", "b")},
    51 you would type something like:
    51 you would type something like:
    52 
    52 
    53 \begin{lstlisting}[language={},numbers=none,basicstyle=\ttfamily\small]
    53 \begin{lstlisting}[language={},numbers=none,basicstyle=\ttfamily\small]
    54 $ scala -cp danube.jar
    54 $ scala -cp wordle.jar
    55 scala> val ratings_url =
    55 scala> val secretsURL =
    56      | """https://nms.kcl.ac.uk/christian.urban/ratings.csv"""
    56      | """https://nms.kcl.ac.uk/christian.urban/wordle.txt"""
    57 
    57 
    58 scala> M2.get_csv_url(ratings_url)
    58 scala> M2.get_wordle_list(secretsURL)
    59 val res0: List[String] = List(1,1,4 ...)
    59 val res0: List[String] = List(aahed, aalii, ...)
    60 \end{lstlisting}%$
    60 \end{lstlisting}%$
    61 
    61 
    62 \subsection*{Hints}
    62 \subsection*{Hints}
    63 
    63 
    64 \noindent
    64 \noindent
    65 Use \texttt{.split(",").toList} for splitting
    65 Useful data functions: \texttt{Source.fromURL},
    66 strings according to commas (similarly for the newline character \mbox{$\backslash$\texttt{n}}),
    66 \texttt{Source.fromFile} for obtaining a webpage and reading a file,
    67 \texttt{.getOrElse(..,..)} allows to query a Map, but also gives a
    67 \texttt{.getOrElse(..,..)} allows to query a Map, but also gives a
    68 default value if the Map is not defined, a Map can be `updated' by
    68 default value if the Map is not defined, a Map can be `updated' by
    69 using \texttt{+}, \texttt{.contains} and \texttt{.filter} can test whether
    69 using \texttt{+}, \texttt{.contains} and \texttt{.filter} can test whether
    70 an element is included in a list, and respectively filter out elements in a list,
    70 an element is included in a list, and respectively filter out elements in a list,
    71 \texttt{.sortBy(\_.\_2)} sorts a list of pairs according to the second
    71 \texttt{.sortBy(\_.\_2)} sorts a list of pairs according to the second
    72 elements in the pairs---the sorting is done from smallest to highest,
    72 elements in the pairs---the sorting is done from smallest to highest,
    73 \texttt{.take(n)} for taking some elements in a list (takes fewer if the list
    73 \texttt{.groupBy} orders lists according to same elements
    74 contains less than \texttt{n} elements).
    74 .
    75 
    75 
    76 
    76 
    77 \newpage
    77 \newpage
    78 
    78 
    79 
    79 
    80 \subsection*{Main Part 2 (6 Marks, file danube.scala)}
    80 \subsection*{Main Part 2 (6 Marks, file wordle.scala)}
    81 
    81 
    82 You are creating Danube.co.uk which you hope will be the next big thing
    82 You probably know the game of Wordle\footnote{\url{https://en.wikipedia.org/wiki/Wordle}}
    83 in online movie renting. You know that you can save money by
    83 where you are supposed to guess a five-letter word. The feedback for guesses can help
    84 anticipating what movies people will rent; you will pass these savings
    84 with the next guess (green letters are correct, orange letters are present, but in the
    85 on to your users by offering a discount if they rent movies that
    85 wrong place). For example:
    86 Danube.co.uk recommends.  
       
    87 
       
    88 Your task is to generate \emph{two} movie recommendations for every
       
    89 movie a user rents. To do this, you calculate what other
       
    90 renters, who also watched this movie, suggest by giving positive ratings.
       
    91 Of course, some suggestions are more popular than others. You need to find
       
    92 the two most-frequently suggested movies. Return fewer recommendations,
       
    93 if there are fewer movies suggested.
       
    94 
       
    95 The calculations will be based on the small datasets which the research lab
       
    96 GroupLens provides for education and development purposes.
       
    97 
    86 
    98 \begin{center}
    87 \begin{center}
    99 \url{https://grouplens.org/datasets/movielens/}
    88 \includegraphics[scale=0.2]{../pics/w.jpeg}
   100 \end{center}
    89 \end{center}  
   101 
    90 
   102 \noindent
    91 \noindent
   103 The slightly adapted CSV-files should be downloaded in your Scala
    92 The idea of the program to be implemented here is to make the Wordle game as evil as possible
   104 file from the URLs:
    93 by finding words that are the most difficult to guess. A word list of five-letter words is available
   105 
    94 from 
   106 
    95 
   107 \begin{center}
    96 \begin{center}
   108 \begin{tabular}{ll}  
    97 \begin{tabular}{ll}  
   109   \url{https://nms.kcl.ac.uk/christian.urban/ratings.csv} & (940 KByte)\\
    98   \url{https://nms.kcl.ac.uk/christian.urban/wordle.txt} & (78 KByte)\\
   110   \url{https://nms.kcl.ac.uk/christian.urban/movies.csv}  & (280 KByte)\\
       
   111 \end{tabular}
    99 \end{tabular}
   112 \end{center}
   100 \end{center}
   113 
   101 
   114 \noindent
   102 \noindent
   115 The ratings.csv file is organised as userID, 
   103 In your program you need to download this list and implement some
   116 movieID, and rating (which is between 0 and 5, with \emph{positive} ratings
   104 functions that in the end select the most difficult words (given an
   117 being 4 and 5). The file movie.csv is organised as
   105 input from the user).  If bandwidth is an issue for you, download the
   118 movieID and full movie name. Both files still contain the usual
   106 file locally, but in the submitted version use \texttt{Source.fromURL}
   119 CSV-file header (first line). In this part you are asked
   107 instead of \texttt{Source.fromFile}.
   120 to implement functions that process these files. If bandwidth
       
   121 is an issue for you, download the files locally, but in the submitted
       
   122 version use \texttt{Source.fromURL} instead of \texttt{Source.fromFile}.
       
   123 
   108 
   124 \subsection*{Tasks}
   109 \subsection*{Tasks}
   125 
   110 
   126 \begin{itemize}
   111 \begin{itemize}
   127 \item[(1)] Implement the function \pcode{get_csv_url} which takes an
   112 \item[(1)] Implement the function \pcode{get_wordle_list} which takes an
   128   URL-string as argument and requests the corresponding file. The two
   113   URL-string as argument and requests the corresponding file. The function should
   129   URLs of interest are \pcode{ratings_url} and \pcode{movies_url},
   114   return the word list appropriately broken up into lines.
   130   which correspond to CSV-files mentioned above.  The function should
   115   The result should be a list of strings (the lines in the file). In case
   131   return the CSV-file appropriately broken up into lines, and the
   116   the url does not produce a file, return the empty list.
   132   first line should be dropped (that is omit the header of the CSV-file).
   117   \mbox{}\hfill [0.5 Marks]
   133   The result is a list of strings (the lines in the file). In case
   118 
   134   the url does not produce a file, return the empty list.\\
   119 \item[(2)] Implement a polymorphic function \pcode{removeN}, which removes $n$ occurrences of an
       
   120   element from a list (if this element is less than $n$ times present, then remove all occurrences).
       
   121   For example
       
   122 
       
   123 \begin{lstlisting}[numbers=none]
       
   124 removeN(List(1,2,3,2,1), 3, 2)  => List(1, 2, 2, 1)
       
   125 removeN(List(1,2,3,2,1), 2, 1)  => List(1, 3, 2, 1)
       
   126 removeN(List(1,2,3,2,1), 2, 2)  => List(1, 3, 1)
       
   127 removeN(List(1,2,3,2,1), 1, 1)  => List(2, 3, 2, 1)
       
   128 removeN(List(1,2,3,2,1), 1, 3)  => List(2, 3, 2)
       
   129 removeN(List(1,2,3,2,1), 0, 2)  => List(1, 2, 3, 2, 1)
       
   130 \end{lstlisting}
       
   131 
       
   132 Make sure you only remove at most $n$ occurrences of the element from the list.
       
   133 This function should work for lists of integers but also lists of chars, strings etc.\\
       
   134   \mbox{}\hfill [0.5 Marks]
       
   135 
       
   136 \item[(3)] Implement a function \pcode{score} that calculates the
       
   137   feedback for a word against a secret word using the rules of the
       
   138   Wordle game. The output of \pcode{score} should be a list of 5
       
   139   elements of type \pcode{Tip} representing three outcomes: a letter
       
   140   in the correct position, a letter that is present, but not in the
       
   141   correct position and a letter that is absent. For example given the
       
   142   secret word "chess" the score for the word "caves" is
       
   143 
       
   144 \begin{lstlisting}[numbers=none]
       
   145 List(Correct, Absent, Absent, Present, Correct)
       
   146 \end{lstlisting}
       
   147 
       
   148   You have to be careful with multiple occurrences of letters. For example
       
   149   the secret "chess" with the guess "swiss" should produce
       
   150 
       
   151 \begin{lstlisting}[numbers=none]
       
   152 List(Absent, Absent, Absent, Correct, Correct)
       
   153 \end{lstlisting}
       
   154 
       
   155 even though the first 's' in "swiss" is present in the secret word, the 's' are already
       
   156 `used up' by the two letters that are correct. To implement this you need to
       
   157 implement first a function \pcode{pool} which calculates all the letters in
       
   158 a secret that are not correct in a word. For example
       
   159 
       
   160 \begin{lstlisting}[numbers=none]
       
   161   pool("chess", "caves")  => List(h, e, s)
       
   162   pool("chess", "swiss")  => List(c, h, e)
       
   163 \end{lstlisting}
       
   164 
       
   165   Now the helper function \pcode{aux} can analyse the arguments secret and word recursively letter-by-letter and
       
   166   decide: if the letters are the same, then return \pcode{Correct} for the corresponding position.
       
   167   If they are not the same, but the letter is in the pool, then return \pcode{Present} and also remove
       
   168   this letter from the pool in the next recursive call of \pcode{aux}. Otherwise return \pcode{Absent} for the
       
   169   corresponding position. The function \pcode{score} is a wrapper for the function \pcode{aux}
       
   170   calling \pcode{aux} with the appropriate arguments (recall what is calculated with \pcode{pool}).\mbox{}\hfill [2 Marks]
       
   171 
       
   172 \item[(4)] Implement a function \pcode{eval} that gives an integer value to each of the
       
   173   \pcode{Tip}s such that
       
   174 
       
   175   \begin{center}
       
   176   \begin{tabular}{lcl}  
       
   177     \textit{eval (Correct)} & $\dn$ & $10$\\
       
   178     \textit{eval (Present)} & $\dn$ & $1$\\
       
   179     \textit{eval (Absent)} & $\dn$ & $0$
       
   180   \end{tabular}                                   
       
   181   \end{center}  
       
   182 
       
   183   The function \pcode{iscore} then takes an output of \pcode{score} and sums
       
   184   up all corresponding values. For example for 
       
   185 
       
   186 \begin{lstlisting}[numbers=none]
       
   187   iscore("chess", "caves")  => 21
       
   188   iscore("chess", "swiss")  => 20
       
   189 \end{lstlisting}
       
   190   \mbox{}\hfill [0.5 Marks]
       
   191 
       
   192 \item[(5)] The function \pcode{evil} takes a list of secrets (the list from Task 1)
       
   193   and a word as arguments, and calculates the list of words with the lowest
       
   194   score (remember we want to make the Wordle game as difficult as possible---therefore
       
   195   when the user gives us a word, we want to find the secrets that produce the lowest
       
   196   score). For this implement a helper function \pcode{lowest} that goes through
       
   197   the secrets one-by-one and calculates the score. The argument \pcode{current} is
       
   198   the score of the ``currently'' found secrets. When the function \pcode{lowest}
       
   199   is called for the first time then this will be set to the maximum integer value
       
   200   \pcode{Int.MaxValue}. The accumulator will be first empty. If a secret is found
       
   201   with the same score as \pcode{current} then this word is added to the accumulator.
       
   202   If the secret has a lower score, then the accumulator will be discarded and this
       
   203   secret will be the new accumulator. If the secret has a higher score, then it can be
       
   204   ignored. For example \pcode{evil} (the wrapper for \pcode{lowest}) generates
       
   205 
       
   206 \begin{lstlisting}[numbers=none]
       
   207 evil(secrets, "stent").length => 1907
       
   208 evil(secrets, "hexes").length => 2966
       
   209 evil(secrets, "horse").length => 1203
       
   210 evil(secrets, "hoise").length => 971
       
   211 evil(secrets, "house").length => 1228
       
   212 \end{lstlisting}
       
   213 
       
   214 where \pcode{secrets} is the list generated under Task 1.
       
   215 In all cases above the iscore of the resulting secrets is 0, but this does not need to be the case
       
   216 in general.\\
       
   217   \mbox{}\hfill [1.5 Marks]
       
   218 
       
   219 \item[(6)] The secrets generated in Task 5 are the ones with the lowest score
       
   220   with respect to the word. You can think of these as the secrets that are furthest ``away'' from the
       
   221   given word. This is already quite evil for a secret word---remember we can choose
       
   222   a secret \emph{after} a user has given a first word. Now we want to make it
       
   223   even more evil by choosing words that have the most obscure letters. For this we
       
   224   calculate the frequency of how many times certain letters occur in our secrets
       
   225   list (see Task 1). The \emph{frequency} of the letter $c$, say, is given by the formula
       
   226 
       
   227   \begin{center}
       
   228   $\textit{freq(c)} \dn 1 - \frac{\textit{number of occurrences of c}}{\textit{number of all letters}}$  
       
   229   \end{center}  
       
   230 
       
   231   That means that letters that occur fewer times in our secrets have a higher frequency.
       
   232   For example the letter 'y' has the frequency 0.9680234350909651 while the much more
       
   233   often occurring letter 'e' has only 0.897286463151403 (all calculations should be done
       
   234   with Doubles).
       
   235 
       
   236   The function \pcode{frequencies} should calculate the frequencies for all lower-case letters
       
   237   by generating a Map from letters (\pcode{Char}) to Doubles (frequencies).\\ 
   135   \mbox{}\hfill [1 Mark]
   238   \mbox{}\hfill [1 Mark]
   136 
   239 
   137 \item[(2)] Implement two functions that process the (broken up)
   240 \item[(7)] In this task we want to use the output of \pcode{evil}, rank each string in the
   138   CSV-files from (1). The \pcode{process_ratings} function filters out all
   241   generated set and then filter out the strings that are ranked highest (the ones with the most obscure letters).
   139   ratings below 4 and returns a list of (userID, movieID) pairs. The
   242   This list of strings often contains only a single word, but in general there might be more (see below).
   140   \pcode{process_movies} function returns a list of (movieID, title) pairs.
   243   First implement a function \pcode{rank} that takes a frequency map (from 6) and a string as arguments and
   141   Note the input to these functions will be the output of the function
   244   generates a rank by summing up all frequencies of the letters in the string. For example
   142   \pcode{get_csv_url}.\\
   245 
   143   \mbox{}\hfill [1 Mark]
   246 \begin{lstlisting}[numbers=none]
   144 %\end{itemize}  
   247 rank(frequencies(secrets), "adobe") => 4.673604687018193
   145 %  
   248 rank(frequencies(secrets), "gaffe") => 4.745205057045945
   146 %
   249 rank(frequencies(secrets), "fuzzy") => 4.898735738513722
   147 %\subsection*{Part 3 (4 Marks, file danube.scala)}
   250 \end{lstlisting}
   148 %
   251 
   149 %\subsection*{Tasks}
   252   Finally, implement a function \pcode{ranked_evil} that selects from the output of \pcode{evil}
   150 %
   253   the string(s) which are highest ranked in evilness.
   151 %\begin{itemize}
   254 
   152 \item[(3)] Implement a kind of grouping function that calculates a Map
       
   153   containing the userIDs and all the corresponding recommendations for
       
   154   this user (list of movieIDs). This should be implemented in a
       
   155   tail-recursive fashion using a Map as accumulator. This Map is set to
       
   156   \pcode{Map()} at the beginning of the calculation. For example
       
   157 
       
   158 \begin{lstlisting}[numbers=none]
       
   159 val lst = List(("1", "a"), ("1", "b"),
       
   160                ("2", "x"), ("3", "a"),
       
   161                ("2", "y"), ("3", "c"))
       
   162 groupById(lst, Map())
       
   163 \end{lstlisting}
       
   164 
       
   165 returns the ratings map
       
   166 
       
   167 \begin{center}
       
   168   \pcode{Map(1 -> List(b, a), 2 -> List(y, x), 3 -> List(c, a))}.
       
   169 \end{center}
       
   170 
       
   171 \noindent
       
   172 In which order the elements of the list are given is unimportant.\\
       
   173 \mbox{}\hfill [1 Mark]
       
   174 
       
   175 \item[(4)] Implement a function that takes a ratings map and a movieID
       
   176   as arguments.  The function calculates all suggestions containing the
       
   177   given movie in its recommendations. It returns a list of all these
       
   178   recommendations (each of them is a list and needs to have the given
       
   179   movie deleted, otherwise it might happen we recommend the same movie
       
   180   ``back''). For example for the Map from above and the movie
       
   181   \pcode{"y"} we obtain \pcode{List(List("x"))}, and for the movie
       
   182   \pcode{"a"} we get \pcode{List(List("b"), List("c"))}.\\
       
   183   \mbox{}\hfill [1 Mark]
       
   184 
       
   185 \item[(5)] Implement a suggestions function which takes a ratings map
       
   186   and a movieID as arguments. It calculates all the recommended movies
       
   187   sorted according to the most frequently suggested movie(s) sorted
       
   188   first. This function returns \emph{all} suggested movieIDs as a list of
       
   189   strings.\\
       
   190   \mbox{}\hfill [1 Mark]
       
   191 
       
   192 \item[(6)]  
       
   193   Implement then a recommendation function which generates a maximum
       
   194   of two most-suggested movies (as calculated above). But it returns
       
   195   the actual movie name, not the movieID. If fewer movies are recommended,
       
   196   then return fewer than two movie names.\\
       
   197   \mbox{}\hfill [1 Mark]
       
   198 
       
   199 %\item[(7)] Calculate the recommendations for all movies according to
       
   200 % what the recommendations function in (6) produces (this
       
   201 % can take a few seconds). Put all recommendations into a list 
       
   202 % (of strings) and count how often the strings occur in
       
   203 % this list. This produces a list of string-int pairs,
       
   204 % where the first component is the movie name and the second
       
   205 % is the number of how many times the movie was recommended. 
       
   206 % Sort all the pairs according to the number
       
   207 % of times they were recommended (most recommended movie name 
       
   208 % first).\\
       
   209 % \mbox{}\hfill [1 Mark]
       
   210   
   255   
       
   256 \begin{lstlisting}[numbers=none]
       
   257 ranked_evil(secrets, "abbey") => List(whizz)
       
   258 ranked_evil(secrets, "afear") => List(buzzy)
       
   259 ranked_evil(secrets, "zincy") => List(jugum)
       
   260 ranked_evil(secrets, "zippy") => List(chuff)
       
   261 \end{lstlisting}
       
   262 
       
   263 This means if the user types in "abbey" then the most evil word to choose as secret is ``whizz'' (according to
       
   264 our calculations). This word has a zero \pcode{iscore} and the most obscure letters.
       
   265 
       
   266 %
       
   267 %\color{red}
       
   268 %\section*{Correction with \texttt{ranked\_evil}}
       
   269 %
       
   270 %The testcases above are actually not the maximum, but the minimum! I will make sure
       
   271 %that the task will count as solved when either the minimum (as above) or the maximum (as intended)
       
   272 %is used. The correct solutions for the above testcases using the maximum are:
       
   273 %\color{black}
       
   274 %
       
   275 %\begin{lstlisting}[numbers=none]
       
   276 %ranked_evil(secrets, "beats") => List(fuzzy)
       
   277 %ranked_evil(secrets, "vitae") => List(fuzzy)
       
   278 %ranked_evil(secrets, "bento") => List(fuzzy)
       
   279 %ranked_evil(secrets, "belts") => List(fuzzy)
       
   280 %\end{lstlisting}
       
   281 %
       
   282 %\noindent \textcolor{red}{Some further testcases for the maximum are:}
       
   283 %
       
   284 %\begin{lstlisting}[numbers=none]
       
   285 %ranked_evil(secrets, "abbey") => List(whizz)
       
   286 %ranked_evil(secrets, "afear") => List(buzzy)
       
   287 %ranked_evil(secrets, "zincy") => List(jugum)
       
   288 %ranked_evil(secrets, "zippy") => List(chuff)
       
   289 %\end{lstlisting}
       
   290 % 
       
   291 %
       
   292 
       
   293 \mbox{}\hfill [1 Mark]  
   211 \end{itemize}
   294 \end{itemize}
   212 
   295 
   213 \end{document} 
   296 \end{document} 
   214 
   297 
   215 %%% Local Variables: 
   298 %%% Local Variables: