diff -r 19b75e899d37 -r 9c03b5e89a2a solutions-resit/cw-resit.tex --- a/solutions-resit/cw-resit.tex Fri Apr 26 17:29:30 2024 +0100 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,208 +0,0 @@ -\documentclass{article} -\usepackage{../style} -\usepackage{../langs} - -\begin{document} - -\section*{August Exam (Scala): Chat Log Mining} - -This coursework is worth 50\%. It is about mining a log of an online -chat between 85 participants. The log is given as a csv-list in the file -\texttt{log.csv}. The log is an unordered list containing information which -message has been sent, by whom, when and in response to which other -message. Each message has also a number and a unique hash code.\bigskip - -\noindent -\textbf{Important:} Make sure the file you submit can be processed -by just calling - -\begin{center} - \texttt{scala <>} -\end{center} - -\noindent -Do not use any mutable data structures in your -submission! They are not needed. This means you cannot use -\texttt{ListBuffer}s, \texttt{Array}s, for example. Do not use -\texttt{return} in your code! It has a different meaning in Scala, -than in Java. Do not use \texttt{var}! This declares a mutable -variable. - -\subsection*{Disclaimer} - -It should be understood that the work you submit represents your own -effort! You have not copied from anyone or anywhere else. An exception -is the Scala code I showed during the lectures or uploaded to KEATS, -which you can freely use.\bigskip - - -\subsection*{Background} - -\noindent -The fields in the file \texttt{log.csv} are organised -as follows: - -\begin{center} -\texttt{counter, id, time\_date, name, country, parent\_id, msg} -\end{center} - -\noindent -Each line in this file contains the data for a single message. The field -\texttt{counter} is an integer number given to each message; \texttt{id} is a -unique hash string for a message; \texttt{time\_date} is the time when the message -was sent; \texttt{name} and \texttt{country} is data about the author -of the message, whereby sometimes the authors left the country information -empty; \texttt{parent\_id} is a hash specifying which other message the -message answers (this can also be empty). \texttt{Msg} is the actual -message text. \textbf{Be careful} for the tasks below that this text can contain -commas and needs to be treated special when the line is split up -by using \texttt{line.split(",").toList}. Tasks (2) and (3) are about -processing this data and storing it into the \texttt{Rec}-data-structure, which -is pre-defined in the file \texttt{resit.scala}: - - -\begin{center} -\begin{verbatim} - Rec(num: Int, - msg_id: String, - date: String, - msg: String, - author: String, - country: Option[String], - reply_id : Option[String], - parent: Option[Int] = None, - children: List[Int] = Nil) -\end{verbatim} -\end{center} - -\noindent -The transformation into a Rec-data-structure is a two-step process -where first the fields for parents and children are given default -values. This information is then filled in in a second step. - -The main information that will be computed in the tasks below is from -which country authors are and how many authors are from each -country. The last task will also rank which messages have been the most -popular in terms of how many replies they received (this will computed -according to be the number children, grand-children and so on of a -message). - -\subsection*{Tasks} - - -\begin{itemize} -\item[(1)] The function \texttt{get\_csv} takes a file name as - argument. It should read the corresponding file and return its - content. The content should be returned as a list of strings, namely a - string for each line in the file. Since the file is a csv-file, the - first line (the header) should be dropped in the result. Lines are - separated by \verb!"\n"!. For the file \texttt{log.csv} there should - be a list of 680 separate strings. - - \mbox{}\hfill[5\% Marks] - - -\item[(2)] The function \texttt{process\_line} takes a single line - from the csv-file (as generated by \texttt{get\_csv}) and creates a - Rec(ord) data structure. This data structure is pre-defined in the - Scala file. - - For processing a line, you should use the function - - \begin{center} - \verb!<>.split(",").toList! - \end{center} - - \noindent - in order to separate the fields. HOWEVER BE CAREFUL that the message - text in the last field of \texttt{log.cvs} can contain commas and - therefore the split will not always result in a list of only 7 - elements. You need to concatenate anything beyond the 7th field into - a single string before assigning the field \texttt{msg}. - - \mbox{}\hfill[10\% Marks] - -\item[(3)] Each record in the log contains a unique hash code - identifying each message. For example - - \begin{center} - \verb!"5ebeb459ac278d01301f1497"! - \end{center} - - \noindent - Some messages also contain a hash code identifying the parent - message (that is to which question they reply). The function - \texttt{post\_process} fills in the information about potential - children and a potential parent message. - - The auxiliary function \texttt{get\_children} takes a record - \texttt{e} and a record list \texttt{rs} as arguments, and returns - the list of all direct children (children have the hash code of - \texttt{e} as \texttt{reply\_id}). The list of children is returned - as a list of \texttt{num}s. The \texttt{num}s can be used later - as indexes in a Rec-list. - - The auxiliary function \texttt{get\_parent} returns the number of - the record corresponding to the \texttt{reply\_id} (encoded as - \texttt{Some} if there exists one, otherwise it returns \texttt{None}). - - In order to update a record, say \texttt{r}, with some additional - information, you can use the Scala code - \begin{verbatim} - r.copy(parent = ...., - children = ....) - \end{verbatim} - - \mbox{}\hfill[10\% Marks] - -\item[(4)] The functions \texttt{get\_countries} and - \texttt{get\_countries\_numbers} calculate the countries where - message authors are coming from and how many authors come from each - country (returned as a \texttt{Map} from countries to Integers). In - case an author did not specify a country, the empty string should - be returned. - - \mbox{}\hfill[10\% Mark] - - -\item[(5)] This task identifies the most popular questions in the log, - whereby popularity is measured in terms of how many follow-up - questions were asked. We call such questions as belonging to a - \emph{thread}. It can be assumed that in \texttt{log.csv} there are - no circular references, that is no question refers to a - follow-up question as parent. - - The function \texttt{ordered\_thread\_sizes} orders the - message threads according to how many answers were given for one - message (that is how many children, grand-children and so on one - message has). - - The auxiliary function \texttt{search} enumerates all children, - grand-children and so on for a given record \texttt{r} (including - the record \texttt{r} itself). \texttt{Search} returns these children - as a list of \texttt{Rec}s. - - The function \texttt{thread\_size} generates for a record, say - \texttt{r}, a pair consisting of the number of \texttt{r} and the - number of all children as produced by search. The numbers are the - integers given for each message---for \texttt{log.cvs} a number - is between 0 and 679. - - The function \texttt{ordered\_thread\_sizes} orders the list of - pairs according to which thread in the chat is the longest (the - longest should be first). - -\mbox{}\hfill[15\% Mark] -\end{itemize} - -\end{document} - - - -\end{document} - - -%%% Local Variables: -%%% mode: latex -%%% TeX-master: t -%%% End: