solutions-resit/cw-resit.tex
changeset 336 25d9c3b2bc99
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/solutions-resit/cw-resit.tex	Sun Aug 23 14:39:58 2020 +0100
@@ -0,0 +1,208 @@
+\documentclass{article}
+\usepackage{../style}
+\usepackage{../langs}
+
+\begin{document}
+
+\section*{August Exam (Scala):  Chat Log Mining}
+
+This coursework is worth 50\%. It is about mining a log of an online
+chat between 85 participants. The log is given as a csv-list in the file
+\texttt{log.csv}. The log is an unordered list containing information which
+message has been sent, by whom, when and in response to which other
+message. Each message has also a number and a unique hash code.\bigskip
+
+\noindent 
+\textbf{Important:} Make sure the file you submit can be processed 
+by just calling
+
+\begin{center}
+  \texttt{scala <<filename.scala>>}
+\end{center}
+
+\noindent
+Do not use any mutable data structures in your
+submission! They are not needed. This means you cannot use
+\texttt{ListBuffer}s, \texttt{Array}s, for example. Do not use
+\texttt{return} in your code! It has a different meaning in Scala,
+than in Java.  Do not use \texttt{var}! This declares a mutable
+variable.  
+
+\subsection*{Disclaimer}
+
+It should be understood that the work you submit represents your own
+effort! You have not copied from anyone or anywhere else. An exception
+is the Scala code I showed during the lectures or uploaded to KEATS,
+which you can freely use.\bigskip
+
+
+\subsection*{Background}
+
+\noindent
+The fields in the file \texttt{log.csv} are organised 
+as follows:
+
+\begin{center}
+\texttt{counter, id, time\_date, name, country, parent\_id, msg}
+\end{center}  
+
+\noindent
+Each line in this file contains the data for a single message.  The field
+\texttt{counter} is an integer number given to each message; \texttt{id} is a
+unique hash string for a message; \texttt{time\_date} is the time when the message
+was sent; \texttt{name} and \texttt{country} is data about the author
+of the message, whereby sometimes the authors left the country information
+empty; \texttt{parent\_id} is a hash specifying which other message the
+message answers (this can also be empty). \texttt{Msg} is the actual
+message text. \textbf{Be careful} for the tasks below that this text can contain
+commas and needs to be treated special when the line is split up
+by using \texttt{line.split(",").toList}. Tasks (2) and (3) are about
+processing this data and storing it into the \texttt{Rec}-data-structure, which
+is pre-defined in the file \texttt{resit.scala}:
+
+
+\begin{center}
+\begin{verbatim}  
+  Rec(num: Int, 
+      msg_id: String,
+      date: String,
+      msg: String,
+      author: String,
+      country: Option[String],
+      reply_id : Option[String],
+      parent: Option[Int] = None,
+      children: List[Int] = Nil)  
+\end{verbatim}
+\end{center}
+
+\noindent
+The transformation into a Rec-data-structure is a two-step process
+where first the fields for parents and children are given default
+values. This information is then filled in in a second step.
+
+The main information that will be computed in the tasks below is from
+which country authors are and how many authors are from each
+country. The last task will also rank which messages have been the most
+popular in terms of how many replies they received (this will computed
+according to be the number children, grand-children and so on of a
+message).
+
+\subsection*{Tasks}
+
+
+\begin{itemize}
+\item[(1)] The function \texttt{get\_csv} takes a file name as
+  argument. It should read the corresponding file and return its
+  content. The content should be returned as a list of strings, namely a
+  string for each line in the file. Since the file is a csv-file, the
+  first line (the header) should be dropped in the result. Lines are
+  separated by \verb!"\n"!. For the file \texttt{log.csv} there should
+  be a list of 680 separate strings.
+
+  \mbox{}\hfill[5\% Marks]
+
+ 
+\item[(2)] The function \texttt{process\_line} takes a single line
+  from the csv-file (as generated by \texttt{get\_csv}) and creates a
+  Rec(ord) data structure. This data structure is pre-defined in the
+  Scala file.
+
+  For processing a line, you should use the function
+
+  \begin{center}
+    \verb!<<some_line>>.split(",").toList!
+  \end{center}
+
+  \noindent
+  in order to separate the fields. HOWEVER BE CAREFUL that the message
+  text in the last field of \texttt{log.cvs} can contain commas and
+  therefore the split will not always result in a list of only 7
+  elements. You need to concatenate anything beyond the 7th field into
+  a single string before assigning the field \texttt{msg}.
+
+  \mbox{}\hfill[10\% Marks]
+
+\item[(3)] Each record in the log contains a unique hash code
+  identifying each message. For example
+
+  \begin{center}
+  \verb!"5ebeb459ac278d01301f1497"!
+  \end{center}  
+
+  \noindent
+  Some messages also contain a hash code identifying the parent
+  message (that is to which question they reply).  The function
+  \texttt{post\_process} fills in the information about potential
+  children and a potential parent message.
+  
+  The auxiliary function \texttt{get\_children} takes a record
+  \texttt{e} and a record list \texttt{rs} as arguments, and returns
+  the list of all direct children (children have the hash code of
+  \texttt{e} as \texttt{reply\_id}). The list of children is returned
+  as a list of \texttt{num}s. The \texttt{num}s can be used later
+  as indexes in a Rec-list.
+      
+  The auxiliary function \texttt{get\_parent} returns the number of
+  the record corresponding to the \texttt{reply\_id} (encoded as
+  \texttt{Some} if there exists one, otherwise it returns \texttt{None}).
+
+  In order to update a record, say \texttt{r}, with some additional
+  information, you can use the Scala code
+  \begin{verbatim}
+      r.copy(parent = ....,
+             children = ....)
+  \end{verbatim}
+
+  \mbox{}\hfill[10\% Marks]
+  
+\item[(4)] The functions \texttt{get\_countries} and
+  \texttt{get\_countries\_numbers} calculate the countries where
+  message authors are coming from and how many authors come from each
+  country (returned as a \texttt{Map} from countries to Integers). In
+  case an author did not specify a country, the empty string should
+  be returned.
+
+  \mbox{}\hfill[10\% Mark]
+
+
+\item[(5)] This task identifies the most popular questions in the log,
+  whereby popularity is measured in terms of how many follow-up
+  questions were asked. We call such questions as belonging to a
+  \emph{thread}. It can be assumed that in \texttt{log.csv} there are
+  no circular references, that is no question refers to a
+  follow-up question as parent.
+
+  The function \texttt{ordered\_thread\_sizes} orders the
+  message threads according to how many answers were given for one
+  message (that is how many children, grand-children and so on one
+  message has).
+
+  The auxiliary function \texttt{search} enumerates all children,
+  grand-children and so on for a given record \texttt{r} (including
+  the record \texttt{r} itself). \texttt{Search} returns these children
+  as a list of \texttt{Rec}s.
+
+  The function \texttt{thread\_size} generates for a record, say
+  \texttt{r}, a pair consisting of the number of \texttt{r} and the
+  number of all children as produced by search. The numbers are the
+  integers given for each message---for \texttt{log.cvs} a number
+  is between 0 and 679.
+
+  The function \texttt{ordered\_thread\_sizes} orders the list of
+  pairs according to which thread in the chat is the longest (the
+  longest should be first).
+
+\mbox{}\hfill[15\% Mark]
+\end{itemize}
+
+\end{document}
+
+  
+
+\end{document}
+
+
+%%% Local Variables: 
+%%% mode: latex
+%%% TeX-master: t
+%%% End: