changeset 486 9c03b5e89a2a
parent 485 19b75e899d37
child 487 efad9725dfd8
equal deleted inserted replaced
485:19b75e899d37 486:9c03b5e89a2a
     1 \documentclass{article}
     2 \usepackage{../style}
     3 \usepackage{../langs}
     5 \begin{document}
     7 \section*{August Exam (Scala):  Chat Log Mining}
     9 This coursework is worth 50\%. It is about mining a log of an online
    10 chat between 85 participants. The log is given as a csv-list in the file
    11 \texttt{log.csv}. The log is an unordered list containing information which
    12 message has been sent, by whom, when and in response to which other
    13 message. Each message has also a number and a unique hash code.\bigskip
    15 \noindent 
    16 \textbf{Important:} Make sure the file you submit can be processed 
    17 by just calling
    19 \begin{center}
    20   \texttt{scala <<filename.scala>>}
    21 \end{center}
    23 \noindent
    24 Do not use any mutable data structures in your
    25 submission! They are not needed. This means you cannot use
    26 \texttt{ListBuffer}s, \texttt{Array}s, for example. Do not use
    27 \texttt{return} in your code! It has a different meaning in Scala,
    28 than in Java.  Do not use \texttt{var}! This declares a mutable
    29 variable.  
    31 \subsection*{Disclaimer}
    33 It should be understood that the work you submit represents your own
    34 effort! You have not copied from anyone or anywhere else. An exception
    35 is the Scala code I showed during the lectures or uploaded to KEATS,
    36 which you can freely use.\bigskip
    39 \subsection*{Background}
    41 \noindent
    42 The fields in the file \texttt{log.csv} are organised 
    43 as follows:
    45 \begin{center}
    46 \texttt{counter, id, time\_date, name, country, parent\_id, msg}
    47 \end{center}  
    49 \noindent
    50 Each line in this file contains the data for a single message.  The field
    51 \texttt{counter} is an integer number given to each message; \texttt{id} is a
    52 unique hash string for a message; \texttt{time\_date} is the time when the message
    53 was sent; \texttt{name} and \texttt{country} is data about the author
    54 of the message, whereby sometimes the authors left the country information
    55 empty; \texttt{parent\_id} is a hash specifying which other message the
    56 message answers (this can also be empty). \texttt{Msg} is the actual
    57 message text. \textbf{Be careful} for the tasks below that this text can contain
    58 commas and needs to be treated special when the line is split up
    59 by using \texttt{line.split(",").toList}. Tasks (2) and (3) are about
    60 processing this data and storing it into the \texttt{Rec}-data-structure, which
    61 is pre-defined in the file \texttt{resit.scala}:
    64 \begin{center}
    65 \begin{verbatim}  
    66   Rec(num: Int, 
    67       msg_id: String,
    68       date: String,
    69       msg: String,
    70       author: String,
    71       country: Option[String],
    72       reply_id : Option[String],
    73       parent: Option[Int] = None,
    74       children: List[Int] = Nil)  
    75 \end{verbatim}
    76 \end{center}
    78 \noindent
    79 The transformation into a Rec-data-structure is a two-step process
    80 where first the fields for parents and children are given default
    81 values. This information is then filled in in a second step.
    83 The main information that will be computed in the tasks below is from
    84 which country authors are and how many authors are from each
    85 country. The last task will also rank which messages have been the most
    86 popular in terms of how many replies they received (this will computed
    87 according to be the number children, grand-children and so on of a
    88 message).
    90 \subsection*{Tasks}
    93 \begin{itemize}
    94 \item[(1)] The function \texttt{get\_csv} takes a file name as
    95   argument. It should read the corresponding file and return its
    96   content. The content should be returned as a list of strings, namely a
    97   string for each line in the file. Since the file is a csv-file, the
    98   first line (the header) should be dropped in the result. Lines are
    99   separated by \verb!"\n"!. For the file \texttt{log.csv} there should
   100   be a list of 680 separate strings.
   102   \mbox{}\hfill[5\% Marks]
   105 \item[(2)] The function \texttt{process\_line} takes a single line
   106   from the csv-file (as generated by \texttt{get\_csv}) and creates a
   107   Rec(ord) data structure. This data structure is pre-defined in the
   108   Scala file.
   110   For processing a line, you should use the function
   112   \begin{center}
   113     \verb!<<some_line>>.split(",").toList!
   114   \end{center}
   116   \noindent
   117   in order to separate the fields. HOWEVER BE CAREFUL that the message
   118   text in the last field of \texttt{log.cvs} can contain commas and
   119   therefore the split will not always result in a list of only 7
   120   elements. You need to concatenate anything beyond the 7th field into
   121   a single string before assigning the field \texttt{msg}.
   123   \mbox{}\hfill[10\% Marks]
   125 \item[(3)] Each record in the log contains a unique hash code
   126   identifying each message. For example
   128   \begin{center}
   129   \verb!"5ebeb459ac278d01301f1497"!
   130   \end{center}  
   132   \noindent
   133   Some messages also contain a hash code identifying the parent
   134   message (that is to which question they reply).  The function
   135   \texttt{post\_process} fills in the information about potential
   136   children and a potential parent message.
   138   The auxiliary function \texttt{get\_children} takes a record
   139   \texttt{e} and a record list \texttt{rs} as arguments, and returns
   140   the list of all direct children (children have the hash code of
   141   \texttt{e} as \texttt{reply\_id}). The list of children is returned
   142   as a list of \texttt{num}s. The \texttt{num}s can be used later
   143   as indexes in a Rec-list.
   145   The auxiliary function \texttt{get\_parent} returns the number of
   146   the record corresponding to the \texttt{reply\_id} (encoded as
   147   \texttt{Some} if there exists one, otherwise it returns \texttt{None}).
   149   In order to update a record, say \texttt{r}, with some additional
   150   information, you can use the Scala code
   151   \begin{verbatim}
   152       r.copy(parent = ....,
   153              children = ....)
   154   \end{verbatim}
   156   \mbox{}\hfill[10\% Marks]
   158 \item[(4)] The functions \texttt{get\_countries} and
   159   \texttt{get\_countries\_numbers} calculate the countries where
   160   message authors are coming from and how many authors come from each
   161   country (returned as a \texttt{Map} from countries to Integers). In
   162   case an author did not specify a country, the empty string should
   163   be returned.
   165   \mbox{}\hfill[10\% Mark]
   168 \item[(5)] This task identifies the most popular questions in the log,
   169   whereby popularity is measured in terms of how many follow-up
   170   questions were asked. We call such questions as belonging to a
   171   \emph{thread}. It can be assumed that in \texttt{log.csv} there are
   172   no circular references, that is no question refers to a
   173   follow-up question as parent.
   175   The function \texttt{ordered\_thread\_sizes} orders the
   176   message threads according to how many answers were given for one
   177   message (that is how many children, grand-children and so on one
   178   message has).
   180   The auxiliary function \texttt{search} enumerates all children,
   181   grand-children and so on for a given record \texttt{r} (including
   182   the record \texttt{r} itself). \texttt{Search} returns these children
   183   as a list of \texttt{Rec}s.
   185   The function \texttt{thread\_size} generates for a record, say
   186   \texttt{r}, a pair consisting of the number of \texttt{r} and the
   187   number of all children as produced by search. The numbers are the
   188   integers given for each message---for \texttt{log.cvs} a number
   189   is between 0 and 679.
   191   The function \texttt{ordered\_thread\_sizes} orders the list of
   192   pairs according to which thread in the chat is the longest (the
   193   longest should be first).
   195 \mbox{}\hfill[15\% Mark]
   196 \end{itemize}
   198 \end{document}
   202 \end{document}
   205 %%% Local Variables: 
   206 %%% mode: latex
   207 %%% TeX-master: t
   208 %%% End: