\documentclass{article}
\usepackage{../style}
\usepackage{../langs}
\begin{document}
\section*{August Exam (Scala): Chat Log Mining}
This coursework is worth 50\%. It is about mining a log of an online
chat between 85 participants. The log is given as a csv-list in the file
\texttt{log.csv}. The log is an unordered list containing information which
message has been sent, by whom, when and in response to which other
message. Each message has also a number and a unique hash code.\bigskip
\noindent
\textbf{Important:} Make sure the file you submit can be processed
by just calling
\begin{center}
\texttt{scala <<filename.scala>>}
\end{center}
\noindent
Do not use any mutable data structures in your
submission! They are not needed. This means you cannot use
\texttt{ListBuffer}s, \texttt{Array}s, for example. Do not use
\texttt{return} in your code! It has a different meaning in Scala,
than in Java. Do not use \texttt{var}! This declares a mutable
variable.
\subsection*{Disclaimer}
It should be understood that the work you submit represents your own
effort! You have not copied from anyone or anywhere else. An exception
is the Scala code I showed during the lectures or uploaded to KEATS,
which you can freely use.\bigskip
\subsection*{Background}
\noindent
The fields in the file \texttt{log.csv} are organised
as follows:
\begin{center}
\texttt{counter, id, time\_date, name, country, parent\_id, msg}
\end{center}
\noindent
Each line in this file contains the data for a single message. The field
\texttt{counter} is an integer number given to each message; \texttt{id} is a
unique hash string for a message; \texttt{time\_date} is the time when the message
was sent; \texttt{name} and \texttt{country} is data about the author
of the message, whereby sometimes the authors left the country information
empty; \texttt{parent\_id} is a hash specifying which other message the
message answers (this can also be empty). \texttt{Msg} is the actual
message text. \textbf{Be careful} for the tasks below that this text can contain
commas and needs to be treated special when the line is split up
by using \texttt{line.split(",").toList}. Tasks (2) and (3) are about
processing this data and storing it into the \texttt{Rec}-data-structure, which
is pre-defined in the file \texttt{resit.scala}:
\begin{center}
\begin{verbatim}
Rec(num: Int,
msg_id: String,
date: String,
msg: String,
author: String,
country: Option[String],
reply_id : Option[String],
parent: Option[Int] = None,
children: List[Int] = Nil)
\end{verbatim}
\end{center}
\noindent
The transformation into a Rec-data-structure is a two-step process
where first the fields for parents and children are given default
values. This information is then filled in in a second step.
The main information that will be computed in the tasks below is from
which country authors are and how many authors are from each
country. The last task will also rank which messages have been the most
popular in terms of how many replies they received (this will computed
according to be the number children, grand-children and so on of a
message).
\subsection*{Tasks}
\begin{itemize}
\item[(1)] The function \texttt{get\_csv} takes a file name as
argument. It should read the corresponding file and return its
content. The content should be returned as a list of strings, namely a
string for each line in the file. Since the file is a csv-file, the
first line (the header) should be dropped in the result. Lines are
separated by \verb!"\n"!. For the file \texttt{log.csv} there should
be a list of 680 separate strings.
\mbox{}\hfill[5\% Marks]
\item[(2)] The function \texttt{process\_line} takes a single line
from the csv-file (as generated by \texttt{get\_csv}) and creates a
Rec(ord) data structure. This data structure is pre-defined in the
Scala file.
For processing a line, you should use the function
\begin{center}
\verb!<<some_line>>.split(",").toList!
\end{center}
\noindent
in order to separate the fields. HOWEVER BE CAREFUL that the message
text in the last field of \texttt{log.cvs} can contain commas and
therefore the split will not always result in a list of only 7
elements. You need to concatenate anything beyond the 7th field into
a single string before assigning the field \texttt{msg}.
\mbox{}\hfill[10\% Marks]
\item[(3)] Each record in the log contains a unique hash code
identifying each message. For example
\begin{center}
\verb!"5ebeb459ac278d01301f1497"!
\end{center}
\noindent
Some messages also contain a hash code identifying the parent
message (that is to which question they reply). The function
\texttt{post\_process} fills in the information about potential
children and a potential parent message.
The auxiliary function \texttt{get\_children} takes a record
\texttt{e} and a record list \texttt{rs} as arguments, and returns
the list of all direct children (children have the hash code of
\texttt{e} as \texttt{reply\_id}). The list of children is returned
as a list of \texttt{num}s. The \texttt{num}s can be used later
as indexes in a Rec-list.
The auxiliary function \texttt{get\_parent} returns the number of
the record corresponding to the \texttt{reply\_id} (encoded as
\texttt{Some} if there exists one, otherwise it returns \texttt{None}).
In order to update a record, say \texttt{r}, with some additional
information, you can use the Scala code
\begin{verbatim}
r.copy(parent = ....,
children = ....)
\end{verbatim}
\mbox{}\hfill[10\% Marks]
\item[(4)] The functions \texttt{get\_countries} and
\texttt{get\_countries\_numbers} calculate the countries where
message authors are coming from and how many authors come from each
country (returned as a \texttt{Map} from countries to Integers). In
case an author did not specify a country, the empty string should
be returned.
\mbox{}\hfill[10\% Mark]
\item[(5)] This task identifies the most popular questions in the log,
whereby popularity is measured in terms of how many follow-up
questions were asked. We call such questions as belonging to a
\emph{thread}. It can be assumed that in \texttt{log.csv} there are
no circular references, that is no question refers to a
follow-up question as parent.
The function \texttt{ordered\_thread\_sizes} orders the
message threads according to how many answers were given for one
message (that is how many children, grand-children and so on one
message has).
The auxiliary function \texttt{search} enumerates all children,
grand-children and so on for a given record \texttt{r} (including
the record \texttt{r} itself). \texttt{Search} returns these children
as a list of \texttt{Rec}s.
The function \texttt{thread\_size} generates for a record, say
\texttt{r}, a pair consisting of the number of \texttt{r} and the
number of all children as produced by search. The numbers are the
integers given for each message---for \texttt{log.cvs} a number
is between 0 and 679.
The function \texttt{ordered\_thread\_sizes} orders the list of
pairs according to which thread in the chat is the longest (the
longest should be first).
\mbox{}\hfill[15\% Mark]
\end{itemize}
\end{document}
\end{document}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: t
%%% End: