solutions-resit/cw-resit.tex
author Christian Urban <christian.urban@kcl.ac.uk>
Mon, 08 Nov 2021 02:37:28 +0000
changeset 412 98a5964d49d1
parent 336 25d9c3b2bc99
permissions -rw-r--r--
updated

\documentclass{article}
\usepackage{../style}
\usepackage{../langs}

\begin{document}

\section*{August Exam (Scala):  Chat Log Mining}

This coursework is worth 50\%. It is about mining a log of an online
chat between 85 participants. The log is given as a csv-list in the file
\texttt{log.csv}. The log is an unordered list containing information which
message has been sent, by whom, when and in response to which other
message. Each message has also a number and a unique hash code.\bigskip

\noindent 
\textbf{Important:} Make sure the file you submit can be processed 
by just calling

\begin{center}
  \texttt{scala <<filename.scala>>}
\end{center}

\noindent
Do not use any mutable data structures in your
submission! They are not needed. This means you cannot use
\texttt{ListBuffer}s, \texttt{Array}s, for example. Do not use
\texttt{return} in your code! It has a different meaning in Scala,
than in Java.  Do not use \texttt{var}! This declares a mutable
variable.  

\subsection*{Disclaimer}

It should be understood that the work you submit represents your own
effort! You have not copied from anyone or anywhere else. An exception
is the Scala code I showed during the lectures or uploaded to KEATS,
which you can freely use.\bigskip


\subsection*{Background}

\noindent
The fields in the file \texttt{log.csv} are organised 
as follows:

\begin{center}
\texttt{counter, id, time\_date, name, country, parent\_id, msg}
\end{center}  

\noindent
Each line in this file contains the data for a single message.  The field
\texttt{counter} is an integer number given to each message; \texttt{id} is a
unique hash string for a message; \texttt{time\_date} is the time when the message
was sent; \texttt{name} and \texttt{country} is data about the author
of the message, whereby sometimes the authors left the country information
empty; \texttt{parent\_id} is a hash specifying which other message the
message answers (this can also be empty). \texttt{Msg} is the actual
message text. \textbf{Be careful} for the tasks below that this text can contain
commas and needs to be treated special when the line is split up
by using \texttt{line.split(",").toList}. Tasks (2) and (3) are about
processing this data and storing it into the \texttt{Rec}-data-structure, which
is pre-defined in the file \texttt{resit.scala}:


\begin{center}
\begin{verbatim}  
  Rec(num: Int, 
      msg_id: String,
      date: String,
      msg: String,
      author: String,
      country: Option[String],
      reply_id : Option[String],
      parent: Option[Int] = None,
      children: List[Int] = Nil)  
\end{verbatim}
\end{center}

\noindent
The transformation into a Rec-data-structure is a two-step process
where first the fields for parents and children are given default
values. This information is then filled in in a second step.

The main information that will be computed in the tasks below is from
which country authors are and how many authors are from each
country. The last task will also rank which messages have been the most
popular in terms of how many replies they received (this will computed
according to be the number children, grand-children and so on of a
message).

\subsection*{Tasks}


\begin{itemize}
\item[(1)] The function \texttt{get\_csv} takes a file name as
  argument. It should read the corresponding file and return its
  content. The content should be returned as a list of strings, namely a
  string for each line in the file. Since the file is a csv-file, the
  first line (the header) should be dropped in the result. Lines are
  separated by \verb!"\n"!. For the file \texttt{log.csv} there should
  be a list of 680 separate strings.

  \mbox{}\hfill[5\% Marks]

 
\item[(2)] The function \texttt{process\_line} takes a single line
  from the csv-file (as generated by \texttt{get\_csv}) and creates a
  Rec(ord) data structure. This data structure is pre-defined in the
  Scala file.

  For processing a line, you should use the function

  \begin{center}
    \verb!<<some_line>>.split(",").toList!
  \end{center}

  \noindent
  in order to separate the fields. HOWEVER BE CAREFUL that the message
  text in the last field of \texttt{log.cvs} can contain commas and
  therefore the split will not always result in a list of only 7
  elements. You need to concatenate anything beyond the 7th field into
  a single string before assigning the field \texttt{msg}.

  \mbox{}\hfill[10\% Marks]

\item[(3)] Each record in the log contains a unique hash code
  identifying each message. For example

  \begin{center}
  \verb!"5ebeb459ac278d01301f1497"!
  \end{center}  

  \noindent
  Some messages also contain a hash code identifying the parent
  message (that is to which question they reply).  The function
  \texttt{post\_process} fills in the information about potential
  children and a potential parent message.
  
  The auxiliary function \texttt{get\_children} takes a record
  \texttt{e} and a record list \texttt{rs} as arguments, and returns
  the list of all direct children (children have the hash code of
  \texttt{e} as \texttt{reply\_id}). The list of children is returned
  as a list of \texttt{num}s. The \texttt{num}s can be used later
  as indexes in a Rec-list.
      
  The auxiliary function \texttt{get\_parent} returns the number of
  the record corresponding to the \texttt{reply\_id} (encoded as
  \texttt{Some} if there exists one, otherwise it returns \texttt{None}).

  In order to update a record, say \texttt{r}, with some additional
  information, you can use the Scala code
  \begin{verbatim}
      r.copy(parent = ....,
             children = ....)
  \end{verbatim}

  \mbox{}\hfill[10\% Marks]
  
\item[(4)] The functions \texttt{get\_countries} and
  \texttt{get\_countries\_numbers} calculate the countries where
  message authors are coming from and how many authors come from each
  country (returned as a \texttt{Map} from countries to Integers). In
  case an author did not specify a country, the empty string should
  be returned.

  \mbox{}\hfill[10\% Mark]


\item[(5)] This task identifies the most popular questions in the log,
  whereby popularity is measured in terms of how many follow-up
  questions were asked. We call such questions as belonging to a
  \emph{thread}. It can be assumed that in \texttt{log.csv} there are
  no circular references, that is no question refers to a
  follow-up question as parent.

  The function \texttt{ordered\_thread\_sizes} orders the
  message threads according to how many answers were given for one
  message (that is how many children, grand-children and so on one
  message has).

  The auxiliary function \texttt{search} enumerates all children,
  grand-children and so on for a given record \texttt{r} (including
  the record \texttt{r} itself). \texttt{Search} returns these children
  as a list of \texttt{Rec}s.

  The function \texttt{thread\_size} generates for a record, say
  \texttt{r}, a pair consisting of the number of \texttt{r} and the
  number of all children as produced by search. The numbers are the
  integers given for each message---for \texttt{log.cvs} a number
  is between 0 and 679.

  The function \texttt{ordered\_thread\_sizes} orders the list of
  pairs according to which thread in the chat is the longest (the
  longest should be first).

\mbox{}\hfill[15\% Mark]
\end{itemize}

\end{document}

  

\end{document}


%%% Local Variables: 
%%% mode: latex
%%% TeX-master: t
%%% End: