solutions-resit/cw-resit.tex
changeset 336 25d9c3b2bc99
equal deleted inserted replaced
335:7e00d2b13b04 336:25d9c3b2bc99
       
     1 \documentclass{article}
       
     2 \usepackage{../style}
       
     3 \usepackage{../langs}
       
     4 
       
     5 \begin{document}
       
     6 
       
     7 \section*{August Exam (Scala):  Chat Log Mining}
       
     8 
       
     9 This coursework is worth 50\%. It is about mining a log of an online
       
    10 chat between 85 participants. The log is given as a csv-list in the file
       
    11 \texttt{log.csv}. The log is an unordered list containing information which
       
    12 message has been sent, by whom, when and in response to which other
       
    13 message. Each message has also a number and a unique hash code.\bigskip
       
    14 
       
    15 \noindent 
       
    16 \textbf{Important:} Make sure the file you submit can be processed 
       
    17 by just calling
       
    18 
       
    19 \begin{center}
       
    20   \texttt{scala <<filename.scala>>}
       
    21 \end{center}
       
    22 
       
    23 \noindent
       
    24 Do not use any mutable data structures in your
       
    25 submission! They are not needed. This means you cannot use
       
    26 \texttt{ListBuffer}s, \texttt{Array}s, for example. Do not use
       
    27 \texttt{return} in your code! It has a different meaning in Scala,
       
    28 than in Java.  Do not use \texttt{var}! This declares a mutable
       
    29 variable.  
       
    30 
       
    31 \subsection*{Disclaimer}
       
    32 
       
    33 It should be understood that the work you submit represents your own
       
    34 effort! You have not copied from anyone or anywhere else. An exception
       
    35 is the Scala code I showed during the lectures or uploaded to KEATS,
       
    36 which you can freely use.\bigskip
       
    37 
       
    38 
       
    39 \subsection*{Background}
       
    40 
       
    41 \noindent
       
    42 The fields in the file \texttt{log.csv} are organised 
       
    43 as follows:
       
    44 
       
    45 \begin{center}
       
    46 \texttt{counter, id, time\_date, name, country, parent\_id, msg}
       
    47 \end{center}  
       
    48 
       
    49 \noindent
       
    50 Each line in this file contains the data for a single message.  The field
       
    51 \texttt{counter} is an integer number given to each message; \texttt{id} is a
       
    52 unique hash string for a message; \texttt{time\_date} is the time when the message
       
    53 was sent; \texttt{name} and \texttt{country} is data about the author
       
    54 of the message, whereby sometimes the authors left the country information
       
    55 empty; \texttt{parent\_id} is a hash specifying which other message the
       
    56 message answers (this can also be empty). \texttt{Msg} is the actual
       
    57 message text. \textbf{Be careful} for the tasks below that this text can contain
       
    58 commas and needs to be treated special when the line is split up
       
    59 by using \texttt{line.split(",").toList}. Tasks (2) and (3) are about
       
    60 processing this data and storing it into the \texttt{Rec}-data-structure, which
       
    61 is pre-defined in the file \texttt{resit.scala}:
       
    62 
       
    63 
       
    64 \begin{center}
       
    65 \begin{verbatim}  
       
    66   Rec(num: Int, 
       
    67       msg_id: String,
       
    68       date: String,
       
    69       msg: String,
       
    70       author: String,
       
    71       country: Option[String],
       
    72       reply_id : Option[String],
       
    73       parent: Option[Int] = None,
       
    74       children: List[Int] = Nil)  
       
    75 \end{verbatim}
       
    76 \end{center}
       
    77 
       
    78 \noindent
       
    79 The transformation into a Rec-data-structure is a two-step process
       
    80 where first the fields for parents and children are given default
       
    81 values. This information is then filled in in a second step.
       
    82 
       
    83 The main information that will be computed in the tasks below is from
       
    84 which country authors are and how many authors are from each
       
    85 country. The last task will also rank which messages have been the most
       
    86 popular in terms of how many replies they received (this will computed
       
    87 according to be the number children, grand-children and so on of a
       
    88 message).
       
    89 
       
    90 \subsection*{Tasks}
       
    91 
       
    92 
       
    93 \begin{itemize}
       
    94 \item[(1)] The function \texttt{get\_csv} takes a file name as
       
    95   argument. It should read the corresponding file and return its
       
    96   content. The content should be returned as a list of strings, namely a
       
    97   string for each line in the file. Since the file is a csv-file, the
       
    98   first line (the header) should be dropped in the result. Lines are
       
    99   separated by \verb!"\n"!. For the file \texttt{log.csv} there should
       
   100   be a list of 680 separate strings.
       
   101 
       
   102   \mbox{}\hfill[5\% Marks]
       
   103 
       
   104  
       
   105 \item[(2)] The function \texttt{process\_line} takes a single line
       
   106   from the csv-file (as generated by \texttt{get\_csv}) and creates a
       
   107   Rec(ord) data structure. This data structure is pre-defined in the
       
   108   Scala file.
       
   109 
       
   110   For processing a line, you should use the function
       
   111 
       
   112   \begin{center}
       
   113     \verb!<<some_line>>.split(",").toList!
       
   114   \end{center}
       
   115 
       
   116   \noindent
       
   117   in order to separate the fields. HOWEVER BE CAREFUL that the message
       
   118   text in the last field of \texttt{log.cvs} can contain commas and
       
   119   therefore the split will not always result in a list of only 7
       
   120   elements. You need to concatenate anything beyond the 7th field into
       
   121   a single string before assigning the field \texttt{msg}.
       
   122 
       
   123   \mbox{}\hfill[10\% Marks]
       
   124 
       
   125 \item[(3)] Each record in the log contains a unique hash code
       
   126   identifying each message. For example
       
   127 
       
   128   \begin{center}
       
   129   \verb!"5ebeb459ac278d01301f1497"!
       
   130   \end{center}  
       
   131 
       
   132   \noindent
       
   133   Some messages also contain a hash code identifying the parent
       
   134   message (that is to which question they reply).  The function
       
   135   \texttt{post\_process} fills in the information about potential
       
   136   children and a potential parent message.
       
   137   
       
   138   The auxiliary function \texttt{get\_children} takes a record
       
   139   \texttt{e} and a record list \texttt{rs} as arguments, and returns
       
   140   the list of all direct children (children have the hash code of
       
   141   \texttt{e} as \texttt{reply\_id}). The list of children is returned
       
   142   as a list of \texttt{num}s. The \texttt{num}s can be used later
       
   143   as indexes in a Rec-list.
       
   144       
       
   145   The auxiliary function \texttt{get\_parent} returns the number of
       
   146   the record corresponding to the \texttt{reply\_id} (encoded as
       
   147   \texttt{Some} if there exists one, otherwise it returns \texttt{None}).
       
   148 
       
   149   In order to update a record, say \texttt{r}, with some additional
       
   150   information, you can use the Scala code
       
   151   \begin{verbatim}
       
   152       r.copy(parent = ....,
       
   153              children = ....)
       
   154   \end{verbatim}
       
   155 
       
   156   \mbox{}\hfill[10\% Marks]
       
   157   
       
   158 \item[(4)] The functions \texttt{get\_countries} and
       
   159   \texttt{get\_countries\_numbers} calculate the countries where
       
   160   message authors are coming from and how many authors come from each
       
   161   country (returned as a \texttt{Map} from countries to Integers). In
       
   162   case an author did not specify a country, the empty string should
       
   163   be returned.
       
   164 
       
   165   \mbox{}\hfill[10\% Mark]
       
   166 
       
   167 
       
   168 \item[(5)] This task identifies the most popular questions in the log,
       
   169   whereby popularity is measured in terms of how many follow-up
       
   170   questions were asked. We call such questions as belonging to a
       
   171   \emph{thread}. It can be assumed that in \texttt{log.csv} there are
       
   172   no circular references, that is no question refers to a
       
   173   follow-up question as parent.
       
   174 
       
   175   The function \texttt{ordered\_thread\_sizes} orders the
       
   176   message threads according to how many answers were given for one
       
   177   message (that is how many children, grand-children and so on one
       
   178   message has).
       
   179 
       
   180   The auxiliary function \texttt{search} enumerates all children,
       
   181   grand-children and so on for a given record \texttt{r} (including
       
   182   the record \texttt{r} itself). \texttt{Search} returns these children
       
   183   as a list of \texttt{Rec}s.
       
   184 
       
   185   The function \texttt{thread\_size} generates for a record, say
       
   186   \texttt{r}, a pair consisting of the number of \texttt{r} and the
       
   187   number of all children as produced by search. The numbers are the
       
   188   integers given for each message---for \texttt{log.cvs} a number
       
   189   is between 0 and 679.
       
   190 
       
   191   The function \texttt{ordered\_thread\_sizes} orders the list of
       
   192   pairs according to which thread in the chat is the longest (the
       
   193   longest should be first).
       
   194 
       
   195 \mbox{}\hfill[15\% Mark]
       
   196 \end{itemize}
       
   197 
       
   198 \end{document}
       
   199 
       
   200   
       
   201 
       
   202 \end{document}
       
   203 
       
   204 
       
   205 %%% Local Variables: 
       
   206 %%% mode: latex
       
   207 %%% TeX-master: t
       
   208 %%% End: