--- a/solutions-resit/cw-resit.tex Fri Apr 26 17:29:30 2024 +0100
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000
@@ -1,208 +0,0 @@
-\documentclass{article}
-\usepackage{../style}
-\usepackage{../langs}
-
-\begin{document}
-
-\section*{August Exam (Scala): Chat Log Mining}
-
-This coursework is worth 50\%. It is about mining a log of an online
-chat between 85 participants. The log is given as a csv-list in the file
-\texttt{log.csv}. The log is an unordered list containing information which
-message has been sent, by whom, when and in response to which other
-message. Each message has also a number and a unique hash code.\bigskip
-
-\noindent
-\textbf{Important:} Make sure the file you submit can be processed
-by just calling
-
-\begin{center}
- \texttt{scala <<filename.scala>>}
-\end{center}
-
-\noindent
-Do not use any mutable data structures in your
-submission! They are not needed. This means you cannot use
-\texttt{ListBuffer}s, \texttt{Array}s, for example. Do not use
-\texttt{return} in your code! It has a different meaning in Scala,
-than in Java. Do not use \texttt{var}! This declares a mutable
-variable.
-
-\subsection*{Disclaimer}
-
-It should be understood that the work you submit represents your own
-effort! You have not copied from anyone or anywhere else. An exception
-is the Scala code I showed during the lectures or uploaded to KEATS,
-which you can freely use.\bigskip
-
-
-\subsection*{Background}
-
-\noindent
-The fields in the file \texttt{log.csv} are organised
-as follows:
-
-\begin{center}
-\texttt{counter, id, time\_date, name, country, parent\_id, msg}
-\end{center}
-
-\noindent
-Each line in this file contains the data for a single message. The field
-\texttt{counter} is an integer number given to each message; \texttt{id} is a
-unique hash string for a message; \texttt{time\_date} is the time when the message
-was sent; \texttt{name} and \texttt{country} is data about the author
-of the message, whereby sometimes the authors left the country information
-empty; \texttt{parent\_id} is a hash specifying which other message the
-message answers (this can also be empty). \texttt{Msg} is the actual
-message text. \textbf{Be careful} for the tasks below that this text can contain
-commas and needs to be treated special when the line is split up
-by using \texttt{line.split(",").toList}. Tasks (2) and (3) are about
-processing this data and storing it into the \texttt{Rec}-data-structure, which
-is pre-defined in the file \texttt{resit.scala}:
-
-
-\begin{center}
-\begin{verbatim}
- Rec(num: Int,
- msg_id: String,
- date: String,
- msg: String,
- author: String,
- country: Option[String],
- reply_id : Option[String],
- parent: Option[Int] = None,
- children: List[Int] = Nil)
-\end{verbatim}
-\end{center}
-
-\noindent
-The transformation into a Rec-data-structure is a two-step process
-where first the fields for parents and children are given default
-values. This information is then filled in in a second step.
-
-The main information that will be computed in the tasks below is from
-which country authors are and how many authors are from each
-country. The last task will also rank which messages have been the most
-popular in terms of how many replies they received (this will computed
-according to be the number children, grand-children and so on of a
-message).
-
-\subsection*{Tasks}
-
-
-\begin{itemize}
-\item[(1)] The function \texttt{get\_csv} takes a file name as
- argument. It should read the corresponding file and return its
- content. The content should be returned as a list of strings, namely a
- string for each line in the file. Since the file is a csv-file, the
- first line (the header) should be dropped in the result. Lines are
- separated by \verb!"\n"!. For the file \texttt{log.csv} there should
- be a list of 680 separate strings.
-
- \mbox{}\hfill[5\% Marks]
-
-
-\item[(2)] The function \texttt{process\_line} takes a single line
- from the csv-file (as generated by \texttt{get\_csv}) and creates a
- Rec(ord) data structure. This data structure is pre-defined in the
- Scala file.
-
- For processing a line, you should use the function
-
- \begin{center}
- \verb!<<some_line>>.split(",").toList!
- \end{center}
-
- \noindent
- in order to separate the fields. HOWEVER BE CAREFUL that the message
- text in the last field of \texttt{log.cvs} can contain commas and
- therefore the split will not always result in a list of only 7
- elements. You need to concatenate anything beyond the 7th field into
- a single string before assigning the field \texttt{msg}.
-
- \mbox{}\hfill[10\% Marks]
-
-\item[(3)] Each record in the log contains a unique hash code
- identifying each message. For example
-
- \begin{center}
- \verb!"5ebeb459ac278d01301f1497"!
- \end{center}
-
- \noindent
- Some messages also contain a hash code identifying the parent
- message (that is to which question they reply). The function
- \texttt{post\_process} fills in the information about potential
- children and a potential parent message.
-
- The auxiliary function \texttt{get\_children} takes a record
- \texttt{e} and a record list \texttt{rs} as arguments, and returns
- the list of all direct children (children have the hash code of
- \texttt{e} as \texttt{reply\_id}). The list of children is returned
- as a list of \texttt{num}s. The \texttt{num}s can be used later
- as indexes in a Rec-list.
-
- The auxiliary function \texttt{get\_parent} returns the number of
- the record corresponding to the \texttt{reply\_id} (encoded as
- \texttt{Some} if there exists one, otherwise it returns \texttt{None}).
-
- In order to update a record, say \texttt{r}, with some additional
- information, you can use the Scala code
- \begin{verbatim}
- r.copy(parent = ....,
- children = ....)
- \end{verbatim}
-
- \mbox{}\hfill[10\% Marks]
-
-\item[(4)] The functions \texttt{get\_countries} and
- \texttt{get\_countries\_numbers} calculate the countries where
- message authors are coming from and how many authors come from each
- country (returned as a \texttt{Map} from countries to Integers). In
- case an author did not specify a country, the empty string should
- be returned.
-
- \mbox{}\hfill[10\% Mark]
-
-
-\item[(5)] This task identifies the most popular questions in the log,
- whereby popularity is measured in terms of how many follow-up
- questions were asked. We call such questions as belonging to a
- \emph{thread}. It can be assumed that in \texttt{log.csv} there are
- no circular references, that is no question refers to a
- follow-up question as parent.
-
- The function \texttt{ordered\_thread\_sizes} orders the
- message threads according to how many answers were given for one
- message (that is how many children, grand-children and so on one
- message has).
-
- The auxiliary function \texttt{search} enumerates all children,
- grand-children and so on for a given record \texttt{r} (including
- the record \texttt{r} itself). \texttt{Search} returns these children
- as a list of \texttt{Rec}s.
-
- The function \texttt{thread\_size} generates for a record, say
- \texttt{r}, a pair consisting of the number of \texttt{r} and the
- number of all children as produced by search. The numbers are the
- integers given for each message---for \texttt{log.cvs} a number
- is between 0 and 679.
-
- The function \texttt{ordered\_thread\_sizes} orders the list of
- pairs according to which thread in the chat is the longest (the
- longest should be first).
-
-\mbox{}\hfill[15\% Mark]
-\end{itemize}
-
-\end{document}
-
-
-
-\end{document}
-
-
-%%% Local Variables:
-%%% mode: latex
-%%% TeX-master: t
-%%% End: