solutions-resit/cw-resit.tex
changeset 486 9c03b5e89a2a
parent 485 19b75e899d37
child 487 efad9725dfd8
--- a/solutions-resit/cw-resit.tex	Fri Apr 26 17:29:30 2024 +0100
+++ /dev/null	Thu Jan 01 00:00:00 1970 +0000
@@ -1,208 +0,0 @@
-\documentclass{article}
-\usepackage{../style}
-\usepackage{../langs}
-
-\begin{document}
-
-\section*{August Exam (Scala):  Chat Log Mining}
-
-This coursework is worth 50\%. It is about mining a log of an online
-chat between 85 participants. The log is given as a csv-list in the file
-\texttt{log.csv}. The log is an unordered list containing information which
-message has been sent, by whom, when and in response to which other
-message. Each message has also a number and a unique hash code.\bigskip
-
-\noindent 
-\textbf{Important:} Make sure the file you submit can be processed 
-by just calling
-
-\begin{center}
-  \texttt{scala <<filename.scala>>}
-\end{center}
-
-\noindent
-Do not use any mutable data structures in your
-submission! They are not needed. This means you cannot use
-\texttt{ListBuffer}s, \texttt{Array}s, for example. Do not use
-\texttt{return} in your code! It has a different meaning in Scala,
-than in Java.  Do not use \texttt{var}! This declares a mutable
-variable.  
-
-\subsection*{Disclaimer}
-
-It should be understood that the work you submit represents your own
-effort! You have not copied from anyone or anywhere else. An exception
-is the Scala code I showed during the lectures or uploaded to KEATS,
-which you can freely use.\bigskip
-
-
-\subsection*{Background}
-
-\noindent
-The fields in the file \texttt{log.csv} are organised 
-as follows:
-
-\begin{center}
-\texttt{counter, id, time\_date, name, country, parent\_id, msg}
-\end{center}  
-
-\noindent
-Each line in this file contains the data for a single message.  The field
-\texttt{counter} is an integer number given to each message; \texttt{id} is a
-unique hash string for a message; \texttt{time\_date} is the time when the message
-was sent; \texttt{name} and \texttt{country} is data about the author
-of the message, whereby sometimes the authors left the country information
-empty; \texttt{parent\_id} is a hash specifying which other message the
-message answers (this can also be empty). \texttt{Msg} is the actual
-message text. \textbf{Be careful} for the tasks below that this text can contain
-commas and needs to be treated special when the line is split up
-by using \texttt{line.split(",").toList}. Tasks (2) and (3) are about
-processing this data and storing it into the \texttt{Rec}-data-structure, which
-is pre-defined in the file \texttt{resit.scala}:
-
-
-\begin{center}
-\begin{verbatim}  
-  Rec(num: Int, 
-      msg_id: String,
-      date: String,
-      msg: String,
-      author: String,
-      country: Option[String],
-      reply_id : Option[String],
-      parent: Option[Int] = None,
-      children: List[Int] = Nil)  
-\end{verbatim}
-\end{center}
-
-\noindent
-The transformation into a Rec-data-structure is a two-step process
-where first the fields for parents and children are given default
-values. This information is then filled in in a second step.
-
-The main information that will be computed in the tasks below is from
-which country authors are and how many authors are from each
-country. The last task will also rank which messages have been the most
-popular in terms of how many replies they received (this will computed
-according to be the number children, grand-children and so on of a
-message).
-
-\subsection*{Tasks}
-
-
-\begin{itemize}
-\item[(1)] The function \texttt{get\_csv} takes a file name as
-  argument. It should read the corresponding file and return its
-  content. The content should be returned as a list of strings, namely a
-  string for each line in the file. Since the file is a csv-file, the
-  first line (the header) should be dropped in the result. Lines are
-  separated by \verb!"\n"!. For the file \texttt{log.csv} there should
-  be a list of 680 separate strings.
-
-  \mbox{}\hfill[5\% Marks]
-
- 
-\item[(2)] The function \texttt{process\_line} takes a single line
-  from the csv-file (as generated by \texttt{get\_csv}) and creates a
-  Rec(ord) data structure. This data structure is pre-defined in the
-  Scala file.
-
-  For processing a line, you should use the function
-
-  \begin{center}
-    \verb!<<some_line>>.split(",").toList!
-  \end{center}
-
-  \noindent
-  in order to separate the fields. HOWEVER BE CAREFUL that the message
-  text in the last field of \texttt{log.cvs} can contain commas and
-  therefore the split will not always result in a list of only 7
-  elements. You need to concatenate anything beyond the 7th field into
-  a single string before assigning the field \texttt{msg}.
-
-  \mbox{}\hfill[10\% Marks]
-
-\item[(3)] Each record in the log contains a unique hash code
-  identifying each message. For example
-
-  \begin{center}
-  \verb!"5ebeb459ac278d01301f1497"!
-  \end{center}  
-
-  \noindent
-  Some messages also contain a hash code identifying the parent
-  message (that is to which question they reply).  The function
-  \texttt{post\_process} fills in the information about potential
-  children and a potential parent message.
-  
-  The auxiliary function \texttt{get\_children} takes a record
-  \texttt{e} and a record list \texttt{rs} as arguments, and returns
-  the list of all direct children (children have the hash code of
-  \texttt{e} as \texttt{reply\_id}). The list of children is returned
-  as a list of \texttt{num}s. The \texttt{num}s can be used later
-  as indexes in a Rec-list.
-      
-  The auxiliary function \texttt{get\_parent} returns the number of
-  the record corresponding to the \texttt{reply\_id} (encoded as
-  \texttt{Some} if there exists one, otherwise it returns \texttt{None}).
-
-  In order to update a record, say \texttt{r}, with some additional
-  information, you can use the Scala code
-  \begin{verbatim}
-      r.copy(parent = ....,
-             children = ....)
-  \end{verbatim}
-
-  \mbox{}\hfill[10\% Marks]
-  
-\item[(4)] The functions \texttt{get\_countries} and
-  \texttt{get\_countries\_numbers} calculate the countries where
-  message authors are coming from and how many authors come from each
-  country (returned as a \texttt{Map} from countries to Integers). In
-  case an author did not specify a country, the empty string should
-  be returned.
-
-  \mbox{}\hfill[10\% Mark]
-
-
-\item[(5)] This task identifies the most popular questions in the log,
-  whereby popularity is measured in terms of how many follow-up
-  questions were asked. We call such questions as belonging to a
-  \emph{thread}. It can be assumed that in \texttt{log.csv} there are
-  no circular references, that is no question refers to a
-  follow-up question as parent.
-
-  The function \texttt{ordered\_thread\_sizes} orders the
-  message threads according to how many answers were given for one
-  message (that is how many children, grand-children and so on one
-  message has).
-
-  The auxiliary function \texttt{search} enumerates all children,
-  grand-children and so on for a given record \texttt{r} (including
-  the record \texttt{r} itself). \texttt{Search} returns these children
-  as a list of \texttt{Rec}s.
-
-  The function \texttt{thread\_size} generates for a record, say
-  \texttt{r}, a pair consisting of the number of \texttt{r} and the
-  number of all children as produced by search. The numbers are the
-  integers given for each message---for \texttt{log.cvs} a number
-  is between 0 and 679.
-
-  The function \texttt{ordered\_thread\_sizes} orders the list of
-  pairs according to which thread in the chat is the longest (the
-  longest should be first).
-
-\mbox{}\hfill[15\% Mark]
-\end{itemize}
-
-\end{document}
-
-  
-
-\end{document}
-
-
-%%% Local Variables: 
-%%% mode: latex
-%%% TeX-master: t
-%%% End: