--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/solutions-resit/cw-resit.tex Sun Aug 23 14:39:58 2020 +0100
@@ -0,0 +1,208 @@
+\documentclass{article}
+\usepackage{../style}
+\usepackage{../langs}
+
+\begin{document}
+
+\section*{August Exam (Scala): Chat Log Mining}
+
+This coursework is worth 50\%. It is about mining a log of an online
+chat between 85 participants. The log is given as a csv-list in the file
+\texttt{log.csv}. The log is an unordered list containing information which
+message has been sent, by whom, when and in response to which other
+message. Each message has also a number and a unique hash code.\bigskip
+
+\noindent
+\textbf{Important:} Make sure the file you submit can be processed
+by just calling
+
+\begin{center}
+ \texttt{scala <<filename.scala>>}
+\end{center}
+
+\noindent
+Do not use any mutable data structures in your
+submission! They are not needed. This means you cannot use
+\texttt{ListBuffer}s, \texttt{Array}s, for example. Do not use
+\texttt{return} in your code! It has a different meaning in Scala,
+than in Java. Do not use \texttt{var}! This declares a mutable
+variable.
+
+\subsection*{Disclaimer}
+
+It should be understood that the work you submit represents your own
+effort! You have not copied from anyone or anywhere else. An exception
+is the Scala code I showed during the lectures or uploaded to KEATS,
+which you can freely use.\bigskip
+
+
+\subsection*{Background}
+
+\noindent
+The fields in the file \texttt{log.csv} are organised
+as follows:
+
+\begin{center}
+\texttt{counter, id, time\_date, name, country, parent\_id, msg}
+\end{center}
+
+\noindent
+Each line in this file contains the data for a single message. The field
+\texttt{counter} is an integer number given to each message; \texttt{id} is a
+unique hash string for a message; \texttt{time\_date} is the time when the message
+was sent; \texttt{name} and \texttt{country} is data about the author
+of the message, whereby sometimes the authors left the country information
+empty; \texttt{parent\_id} is a hash specifying which other message the
+message answers (this can also be empty). \texttt{Msg} is the actual
+message text. \textbf{Be careful} for the tasks below that this text can contain
+commas and needs to be treated special when the line is split up
+by using \texttt{line.split(",").toList}. Tasks (2) and (3) are about
+processing this data and storing it into the \texttt{Rec}-data-structure, which
+is pre-defined in the file \texttt{resit.scala}:
+
+
+\begin{center}
+\begin{verbatim}
+ Rec(num: Int,
+ msg_id: String,
+ date: String,
+ msg: String,
+ author: String,
+ country: Option[String],
+ reply_id : Option[String],
+ parent: Option[Int] = None,
+ children: List[Int] = Nil)
+\end{verbatim}
+\end{center}
+
+\noindent
+The transformation into a Rec-data-structure is a two-step process
+where first the fields for parents and children are given default
+values. This information is then filled in in a second step.
+
+The main information that will be computed in the tasks below is from
+which country authors are and how many authors are from each
+country. The last task will also rank which messages have been the most
+popular in terms of how many replies they received (this will computed
+according to be the number children, grand-children and so on of a
+message).
+
+\subsection*{Tasks}
+
+
+\begin{itemize}
+\item[(1)] The function \texttt{get\_csv} takes a file name as
+ argument. It should read the corresponding file and return its
+ content. The content should be returned as a list of strings, namely a
+ string for each line in the file. Since the file is a csv-file, the
+ first line (the header) should be dropped in the result. Lines are
+ separated by \verb!"\n"!. For the file \texttt{log.csv} there should
+ be a list of 680 separate strings.
+
+ \mbox{}\hfill[5\% Marks]
+
+
+\item[(2)] The function \texttt{process\_line} takes a single line
+ from the csv-file (as generated by \texttt{get\_csv}) and creates a
+ Rec(ord) data structure. This data structure is pre-defined in the
+ Scala file.
+
+ For processing a line, you should use the function
+
+ \begin{center}
+ \verb!<<some_line>>.split(",").toList!
+ \end{center}
+
+ \noindent
+ in order to separate the fields. HOWEVER BE CAREFUL that the message
+ text in the last field of \texttt{log.cvs} can contain commas and
+ therefore the split will not always result in a list of only 7
+ elements. You need to concatenate anything beyond the 7th field into
+ a single string before assigning the field \texttt{msg}.
+
+ \mbox{}\hfill[10\% Marks]
+
+\item[(3)] Each record in the log contains a unique hash code
+ identifying each message. For example
+
+ \begin{center}
+ \verb!"5ebeb459ac278d01301f1497"!
+ \end{center}
+
+ \noindent
+ Some messages also contain a hash code identifying the parent
+ message (that is to which question they reply). The function
+ \texttt{post\_process} fills in the information about potential
+ children and a potential parent message.
+
+ The auxiliary function \texttt{get\_children} takes a record
+ \texttt{e} and a record list \texttt{rs} as arguments, and returns
+ the list of all direct children (children have the hash code of
+ \texttt{e} as \texttt{reply\_id}). The list of children is returned
+ as a list of \texttt{num}s. The \texttt{num}s can be used later
+ as indexes in a Rec-list.
+
+ The auxiliary function \texttt{get\_parent} returns the number of
+ the record corresponding to the \texttt{reply\_id} (encoded as
+ \texttt{Some} if there exists one, otherwise it returns \texttt{None}).
+
+ In order to update a record, say \texttt{r}, with some additional
+ information, you can use the Scala code
+ \begin{verbatim}
+ r.copy(parent = ....,
+ children = ....)
+ \end{verbatim}
+
+ \mbox{}\hfill[10\% Marks]
+
+\item[(4)] The functions \texttt{get\_countries} and
+ \texttt{get\_countries\_numbers} calculate the countries where
+ message authors are coming from and how many authors come from each
+ country (returned as a \texttt{Map} from countries to Integers). In
+ case an author did not specify a country, the empty string should
+ be returned.
+
+ \mbox{}\hfill[10\% Mark]
+
+
+\item[(5)] This task identifies the most popular questions in the log,
+ whereby popularity is measured in terms of how many follow-up
+ questions were asked. We call such questions as belonging to a
+ \emph{thread}. It can be assumed that in \texttt{log.csv} there are
+ no circular references, that is no question refers to a
+ follow-up question as parent.
+
+ The function \texttt{ordered\_thread\_sizes} orders the
+ message threads according to how many answers were given for one
+ message (that is how many children, grand-children and so on one
+ message has).
+
+ The auxiliary function \texttt{search} enumerates all children,
+ grand-children and so on for a given record \texttt{r} (including
+ the record \texttt{r} itself). \texttt{Search} returns these children
+ as a list of \texttt{Rec}s.
+
+ The function \texttt{thread\_size} generates for a record, say
+ \texttt{r}, a pair consisting of the number of \texttt{r} and the
+ number of all children as produced by search. The numbers are the
+ integers given for each message---for \texttt{log.cvs} a number
+ is between 0 and 679.
+
+ The function \texttt{ordered\_thread\_sizes} orders the list of
+ pairs according to which thread in the chat is the longest (the
+ longest should be first).
+
+\mbox{}\hfill[15\% Mark]
+\end{itemize}
+
+\end{document}
+
+
+
+\end{document}
+
+
+%%% Local Variables:
+%%% mode: latex
+%%% TeX-master: t
+%%% End: