|
1 \documentclass{article} |
|
2 \usepackage{../style} |
|
3 \usepackage{../langs} |
|
4 |
|
5 \begin{document} |
|
6 |
|
7 \section*{August Exam (Scala): Chat Log Mining} |
|
8 |
|
9 This coursework is worth 50\%. It is about mining a log of an online |
|
10 chat between 85 participants. The log is given as a csv-list in the file |
|
11 \texttt{log.csv}. The log is an unordered list containing information which |
|
12 message has been sent, by whom, when and in response to which other |
|
13 message. Each message has also a number and a unique hash code.\bigskip |
|
14 |
|
15 \noindent |
|
16 \textbf{Important:} Make sure the file you submit can be processed |
|
17 by just calling |
|
18 |
|
19 \begin{center} |
|
20 \texttt{scala <<filename.scala>>} |
|
21 \end{center} |
|
22 |
|
23 \noindent |
|
24 Do not use any mutable data structures in your |
|
25 submission! They are not needed. This means you cannot use |
|
26 \texttt{ListBuffer}s, \texttt{Array}s, for example. Do not use |
|
27 \texttt{return} in your code! It has a different meaning in Scala, |
|
28 than in Java. Do not use \texttt{var}! This declares a mutable |
|
29 variable. |
|
30 |
|
31 \subsection*{Disclaimer} |
|
32 |
|
33 It should be understood that the work you submit represents your own |
|
34 effort! You have not copied from anyone or anywhere else. An exception |
|
35 is the Scala code I showed during the lectures or uploaded to KEATS, |
|
36 which you can freely use.\bigskip |
|
37 |
|
38 |
|
39 \subsection*{Background} |
|
40 |
|
41 \noindent |
|
42 The fields in the file \texttt{log.csv} are organised |
|
43 as follows: |
|
44 |
|
45 \begin{center} |
|
46 \texttt{counter, id, time\_date, name, country, parent\_id, msg} |
|
47 \end{center} |
|
48 |
|
49 \noindent |
|
50 Each line in this file contains the data for a single message. The field |
|
51 \texttt{counter} is an integer number given to each message; \texttt{id} is a |
|
52 unique hash string for a message; \texttt{time\_date} is the time when the message |
|
53 was sent; \texttt{name} and \texttt{country} is data about the author |
|
54 of the message, whereby sometimes the authors left the country information |
|
55 empty; \texttt{parent\_id} is a hash specifying which other message the |
|
56 message answers (this can also be empty). \texttt{Msg} is the actual |
|
57 message text. \textbf{Be careful} for the tasks below that this text can contain |
|
58 commas and needs to be treated special when the line is split up |
|
59 by using \texttt{line.split(",").toList}. Tasks (2) and (3) are about |
|
60 processing this data and storing it into the \texttt{Rec}-data-structure, which |
|
61 is pre-defined in the file \texttt{resit.scala}: |
|
62 |
|
63 |
|
64 \begin{center} |
|
65 \begin{verbatim} |
|
66 Rec(num: Int, |
|
67 msg_id: String, |
|
68 date: String, |
|
69 msg: String, |
|
70 author: String, |
|
71 country: Option[String], |
|
72 reply_id : Option[String], |
|
73 parent: Option[Int] = None, |
|
74 children: List[Int] = Nil) |
|
75 \end{verbatim} |
|
76 \end{center} |
|
77 |
|
78 \noindent |
|
79 The transformation into a Rec-data-structure is a two-step process |
|
80 where first the fields for parents and children are given default |
|
81 values. This information is then filled in in a second step. |
|
82 |
|
83 The main information that will be computed in the tasks below is from |
|
84 which country authors are and how many authors are from each |
|
85 country. The last task will also rank which messages have been the most |
|
86 popular in terms of how many replies they received (this will computed |
|
87 according to be the number children, grand-children and so on of a |
|
88 message). |
|
89 |
|
90 \subsection*{Tasks} |
|
91 |
|
92 |
|
93 \begin{itemize} |
|
94 \item[(1)] The function \texttt{get\_csv} takes a file name as |
|
95 argument. It should read the corresponding file and return its |
|
96 content. The content should be returned as a list of strings, namely a |
|
97 string for each line in the file. Since the file is a csv-file, the |
|
98 first line (the header) should be dropped in the result. Lines are |
|
99 separated by \verb!"\n"!. For the file \texttt{log.csv} there should |
|
100 be a list of 680 separate strings. |
|
101 |
|
102 \mbox{}\hfill[5\% Marks] |
|
103 |
|
104 |
|
105 \item[(2)] The function \texttt{process\_line} takes a single line |
|
106 from the csv-file (as generated by \texttt{get\_csv}) and creates a |
|
107 Rec(ord) data structure. This data structure is pre-defined in the |
|
108 Scala file. |
|
109 |
|
110 For processing a line, you should use the function |
|
111 |
|
112 \begin{center} |
|
113 \verb!<<some_line>>.split(",").toList! |
|
114 \end{center} |
|
115 |
|
116 \noindent |
|
117 in order to separate the fields. HOWEVER BE CAREFUL that the message |
|
118 text in the last field of \texttt{log.cvs} can contain commas and |
|
119 therefore the split will not always result in a list of only 7 |
|
120 elements. You need to concatenate anything beyond the 7th field into |
|
121 a single string before assigning the field \texttt{msg}. |
|
122 |
|
123 \mbox{}\hfill[10\% Marks] |
|
124 |
|
125 \item[(3)] Each record in the log contains a unique hash code |
|
126 identifying each message. For example |
|
127 |
|
128 \begin{center} |
|
129 \verb!"5ebeb459ac278d01301f1497"! |
|
130 \end{center} |
|
131 |
|
132 \noindent |
|
133 Some messages also contain a hash code identifying the parent |
|
134 message (that is to which question they reply). The function |
|
135 \texttt{post\_process} fills in the information about potential |
|
136 children and a potential parent message. |
|
137 |
|
138 The auxiliary function \texttt{get\_children} takes a record |
|
139 \texttt{e} and a record list \texttt{rs} as arguments, and returns |
|
140 the list of all direct children (children have the hash code of |
|
141 \texttt{e} as \texttt{reply\_id}). The list of children is returned |
|
142 as a list of \texttt{num}s. The \texttt{num}s can be used later |
|
143 as indexes in a Rec-list. |
|
144 |
|
145 The auxiliary function \texttt{get\_parent} returns the number of |
|
146 the record corresponding to the \texttt{reply\_id} (encoded as |
|
147 \texttt{Some} if there exists one, otherwise it returns \texttt{None}). |
|
148 |
|
149 In order to update a record, say \texttt{r}, with some additional |
|
150 information, you can use the Scala code |
|
151 \begin{verbatim} |
|
152 r.copy(parent = ...., |
|
153 children = ....) |
|
154 \end{verbatim} |
|
155 |
|
156 \mbox{}\hfill[10\% Marks] |
|
157 |
|
158 \item[(4)] The functions \texttt{get\_countries} and |
|
159 \texttt{get\_countries\_numbers} calculate the countries where |
|
160 message authors are coming from and how many authors come from each |
|
161 country (returned as a \texttt{Map} from countries to Integers). In |
|
162 case an author did not specify a country, the empty string should |
|
163 be returned. |
|
164 |
|
165 \mbox{}\hfill[10\% Mark] |
|
166 |
|
167 |
|
168 \item[(5)] This task identifies the most popular questions in the log, |
|
169 whereby popularity is measured in terms of how many follow-up |
|
170 questions were asked. We call such questions as belonging to a |
|
171 \emph{thread}. It can be assumed that in \texttt{log.csv} there are |
|
172 no circular references, that is no question refers to a |
|
173 follow-up question as parent. |
|
174 |
|
175 The function \texttt{ordered\_thread\_sizes} orders the |
|
176 message threads according to how many answers were given for one |
|
177 message (that is how many children, grand-children and so on one |
|
178 message has). |
|
179 |
|
180 The auxiliary function \texttt{search} enumerates all children, |
|
181 grand-children and so on for a given record \texttt{r} (including |
|
182 the record \texttt{r} itself). \texttt{Search} returns these children |
|
183 as a list of \texttt{Rec}s. |
|
184 |
|
185 The function \texttt{thread\_size} generates for a record, say |
|
186 \texttt{r}, a pair consisting of the number of \texttt{r} and the |
|
187 number of all children as produced by search. The numbers are the |
|
188 integers given for each message---for \texttt{log.cvs} a number |
|
189 is between 0 and 679. |
|
190 |
|
191 The function \texttt{ordered\_thread\_sizes} orders the list of |
|
192 pairs according to which thread in the chat is the longest (the |
|
193 longest should be first). |
|
194 |
|
195 \mbox{}\hfill[15\% Mark] |
|
196 \end{itemize} |
|
197 |
|
198 \end{document} |
|
199 |
|
200 |
|
201 |
|
202 \end{document} |
|
203 |
|
204 |
|
205 %%% Local Variables: |
|
206 %%% mode: latex |
|
207 %%% TeX-master: t |
|
208 %%% End: |