diff -r 7e00d2b13b04 -r 25d9c3b2bc99 solutions-resit/cw-resit.tex --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/solutions-resit/cw-resit.tex Sun Aug 23 14:39:58 2020 +0100 @@ -0,0 +1,208 @@ +\documentclass{article} +\usepackage{../style} +\usepackage{../langs} + +\begin{document} + +\section*{August Exam (Scala): Chat Log Mining} + +This coursework is worth 50\%. It is about mining a log of an online +chat between 85 participants. The log is given as a csv-list in the file +\texttt{log.csv}. The log is an unordered list containing information which +message has been sent, by whom, when and in response to which other +message. Each message has also a number and a unique hash code.\bigskip + +\noindent +\textbf{Important:} Make sure the file you submit can be processed +by just calling + +\begin{center} + \texttt{scala <>} +\end{center} + +\noindent +Do not use any mutable data structures in your +submission! They are not needed. This means you cannot use +\texttt{ListBuffer}s, \texttt{Array}s, for example. Do not use +\texttt{return} in your code! It has a different meaning in Scala, +than in Java. Do not use \texttt{var}! This declares a mutable +variable. + +\subsection*{Disclaimer} + +It should be understood that the work you submit represents your own +effort! You have not copied from anyone or anywhere else. An exception +is the Scala code I showed during the lectures or uploaded to KEATS, +which you can freely use.\bigskip + + +\subsection*{Background} + +\noindent +The fields in the file \texttt{log.csv} are organised +as follows: + +\begin{center} +\texttt{counter, id, time\_date, name, country, parent\_id, msg} +\end{center} + +\noindent +Each line in this file contains the data for a single message. The field +\texttt{counter} is an integer number given to each message; \texttt{id} is a +unique hash string for a message; \texttt{time\_date} is the time when the message +was sent; \texttt{name} and \texttt{country} is data about the author +of the message, whereby sometimes the authors left the country information +empty; \texttt{parent\_id} is a hash specifying which other message the +message answers (this can also be empty). \texttt{Msg} is the actual +message text. \textbf{Be careful} for the tasks below that this text can contain +commas and needs to be treated special when the line is split up +by using \texttt{line.split(",").toList}. Tasks (2) and (3) are about +processing this data and storing it into the \texttt{Rec}-data-structure, which +is pre-defined in the file \texttt{resit.scala}: + + +\begin{center} +\begin{verbatim} + Rec(num: Int, + msg_id: String, + date: String, + msg: String, + author: String, + country: Option[String], + reply_id : Option[String], + parent: Option[Int] = None, + children: List[Int] = Nil) +\end{verbatim} +\end{center} + +\noindent +The transformation into a Rec-data-structure is a two-step process +where first the fields for parents and children are given default +values. This information is then filled in in a second step. + +The main information that will be computed in the tasks below is from +which country authors are and how many authors are from each +country. The last task will also rank which messages have been the most +popular in terms of how many replies they received (this will computed +according to be the number children, grand-children and so on of a +message). + +\subsection*{Tasks} + + +\begin{itemize} +\item[(1)] The function \texttt{get\_csv} takes a file name as + argument. It should read the corresponding file and return its + content. The content should be returned as a list of strings, namely a + string for each line in the file. Since the file is a csv-file, the + first line (the header) should be dropped in the result. Lines are + separated by \verb!"\n"!. For the file \texttt{log.csv} there should + be a list of 680 separate strings. + + \mbox{}\hfill[5\% Marks] + + +\item[(2)] The function \texttt{process\_line} takes a single line + from the csv-file (as generated by \texttt{get\_csv}) and creates a + Rec(ord) data structure. This data structure is pre-defined in the + Scala file. + + For processing a line, you should use the function + + \begin{center} + \verb!<>.split(",").toList! + \end{center} + + \noindent + in order to separate the fields. HOWEVER BE CAREFUL that the message + text in the last field of \texttt{log.cvs} can contain commas and + therefore the split will not always result in a list of only 7 + elements. You need to concatenate anything beyond the 7th field into + a single string before assigning the field \texttt{msg}. + + \mbox{}\hfill[10\% Marks] + +\item[(3)] Each record in the log contains a unique hash code + identifying each message. For example + + \begin{center} + \verb!"5ebeb459ac278d01301f1497"! + \end{center} + + \noindent + Some messages also contain a hash code identifying the parent + message (that is to which question they reply). The function + \texttt{post\_process} fills in the information about potential + children and a potential parent message. + + The auxiliary function \texttt{get\_children} takes a record + \texttt{e} and a record list \texttt{rs} as arguments, and returns + the list of all direct children (children have the hash code of + \texttt{e} as \texttt{reply\_id}). The list of children is returned + as a list of \texttt{num}s. The \texttt{num}s can be used later + as indexes in a Rec-list. + + The auxiliary function \texttt{get\_parent} returns the number of + the record corresponding to the \texttt{reply\_id} (encoded as + \texttt{Some} if there exists one, otherwise it returns \texttt{None}). + + In order to update a record, say \texttt{r}, with some additional + information, you can use the Scala code + \begin{verbatim} + r.copy(parent = ...., + children = ....) + \end{verbatim} + + \mbox{}\hfill[10\% Marks] + +\item[(4)] The functions \texttt{get\_countries} and + \texttt{get\_countries\_numbers} calculate the countries where + message authors are coming from and how many authors come from each + country (returned as a \texttt{Map} from countries to Integers). In + case an author did not specify a country, the empty string should + be returned. + + \mbox{}\hfill[10\% Mark] + + +\item[(5)] This task identifies the most popular questions in the log, + whereby popularity is measured in terms of how many follow-up + questions were asked. We call such questions as belonging to a + \emph{thread}. It can be assumed that in \texttt{log.csv} there are + no circular references, that is no question refers to a + follow-up question as parent. + + The function \texttt{ordered\_thread\_sizes} orders the + message threads according to how many answers were given for one + message (that is how many children, grand-children and so on one + message has). + + The auxiliary function \texttt{search} enumerates all children, + grand-children and so on for a given record \texttt{r} (including + the record \texttt{r} itself). \texttt{Search} returns these children + as a list of \texttt{Rec}s. + + The function \texttt{thread\_size} generates for a record, say + \texttt{r}, a pair consisting of the number of \texttt{r} and the + number of all children as produced by search. The numbers are the + integers given for each message---for \texttt{log.cvs} a number + is between 0 and 679. + + The function \texttt{ordered\_thread\_sizes} orders the list of + pairs according to which thread in the chat is the longest (the + longest should be first). + +\mbox{}\hfill[15\% Mark] +\end{itemize} + +\end{document} + + + +\end{document} + + +%%% Local Variables: +%%% mode: latex +%%% TeX-master: t +%%% End: