cws/pre_cw02.tex
author Christian Urban <christian.urban@kcl.ac.uk>
Mon, 02 Nov 2020 13:10:02 +0000
changeset 348 b5b6ed38c2f2
parent 346 663c2a9108d1
child 355 bc3980949af2
permissions -rw-r--r--
updated jars
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
268
e43f7e92ba26 updated
Christian Urban <urbanc@in.tum.de>
parents: 264
diff changeset
     1
% !TEX program = xelatex
6
aae256985251 updated
Christian Urban <urbanc@in.tum.de>
parents:
diff changeset
     2
\documentclass{article}
39
c6fe374a5fca updated
Christian Urban <urbanc@in.tum.de>
parents: 38
diff changeset
     3
\usepackage{../style}
166
780c40aaad27 updated
Christian Urban <urbanc@in.tum.de>
parents: 163
diff changeset
     4
\usepackage{disclaimer}
202
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
     5
\usepackage{../langs}
6
aae256985251 updated
Christian Urban <urbanc@in.tum.de>
parents:
diff changeset
     6
aae256985251 updated
Christian Urban <urbanc@in.tum.de>
parents:
diff changeset
     7
\begin{document}
aae256985251 updated
Christian Urban <urbanc@in.tum.de>
parents:
diff changeset
     8
aae256985251 updated
Christian Urban <urbanc@in.tum.de>
parents:
diff changeset
     9
329
8a34b2ebc8cc updated
Christian Urban <urbanc@in.tum.de>
parents: 316
diff changeset
    10
%% should ask to lower case the words.
8a34b2ebc8cc updated
Christian Urban <urbanc@in.tum.de>
parents: 316
diff changeset
    11
346
663c2a9108d1 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 333
diff changeset
    12
\section*{Preliminary Part 7 (Scala, 3 Marks)}
6
aae256985251 updated
Christian Urban <urbanc@in.tum.de>
parents:
diff changeset
    13
264
ecd989eee8bd updated
Christian Urban <urbanc@in.tum.de>
parents: 259
diff changeset
    14
\mbox{}\hfill\textit{``What one programmer can do in one month,}\\
ecd989eee8bd updated
Christian Urban <urbanc@in.tum.de>
parents: 259
diff changeset
    15
\mbox{}\hfill\textit{two programmers can do in two months.''}\smallskip\\
276
52faee6d0be2 updated
Christian Urban <urbanc@in.tum.de>
parents: 268
diff changeset
    16
\mbox{}\hfill\textit{ --- Frederick P.~Brooks (author of The Mythical Man-Month)}\bigskip\medskip
264
ecd989eee8bd updated
Christian Urban <urbanc@in.tum.de>
parents: 259
diff changeset
    17
346
663c2a9108d1 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 333
diff changeset
    18
\IMPORTANT{You are asked to implement a Scala program for measuring similarity in
663c2a9108d1 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 333
diff changeset
    19
  texts. The preliminary part is due on \cwSEVEN{} at 5pm and worth 3\%.}
202
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
    20
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
    21
\noindent
144
716042628398 updated
Christian Urban <urbanc@in.tum.de>
parents: 110
diff changeset
    22
Also note that the running time of each part will be restricted to a
202
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
    23
maximum of 30 seconds on my laptop.
39
c6fe374a5fca updated
Christian Urban <urbanc@in.tum.de>
parents: 38
diff changeset
    24
166
780c40aaad27 updated
Christian Urban <urbanc@in.tum.de>
parents: 163
diff changeset
    25
\DISCLAIMER{}
39
c6fe374a5fca updated
Christian Urban <urbanc@in.tum.de>
parents: 38
diff changeset
    26
c6fe374a5fca updated
Christian Urban <urbanc@in.tum.de>
parents: 38
diff changeset
    27
202
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
    28
\subsection*{Reference Implementation}
45
8399976b77fe updated
Christian Urban <urbanc@in.tum.de>
parents: 42
diff changeset
    29
306
Christian Urban <urbanc@in.tum.de>
parents: 301
diff changeset
    30
Like the C++ part, the Scala part works like this: you
202
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
    31
push your files to GitHub and receive (after sometimes a long delay) some
306
Christian Urban <urbanc@in.tum.de>
parents: 301
diff changeset
    32
automated feedback. In the end we will take a snapshot of the submitted files and
268
e43f7e92ba26 updated
Christian Urban <urbanc@in.tum.de>
parents: 264
diff changeset
    33
apply an automated marking script to them.\medskip
45
8399976b77fe updated
Christian Urban <urbanc@in.tum.de>
parents: 42
diff changeset
    34
268
e43f7e92ba26 updated
Christian Urban <urbanc@in.tum.de>
parents: 264
diff changeset
    35
\noindent
306
Christian Urban <urbanc@in.tum.de>
parents: 301
diff changeset
    36
In addition, the Scala part comes with reference
Christian Urban <urbanc@in.tum.de>
parents: 301
diff changeset
    37
implementations in form of \texttt{jar}-files. This allows you to run
202
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
    38
any test cases on your own computer. For example you can call Scala on
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
    39
the command line with the option \texttt{-cp docdiff.jar} and then
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
    40
query any function from the template file. Say you want to find out
203
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    41
what the function \texttt{occurrences} produces: for this you just need
346
663c2a9108d1 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 333
diff changeset
    42
to prefix it with the object name \texttt{CW7a}.  If you want to find out what
202
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
    43
these functions produce for the list \texttt{List("a", "b", "b")},
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
    44
you would type something like:
6
aae256985251 updated
Christian Urban <urbanc@in.tum.de>
parents:
diff changeset
    45
202
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
    46
\begin{lstlisting}[language={},numbers=none,basicstyle=\ttfamily\small]
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
    47
$ scala -cp docdiff.jar
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
    48
  
203
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    49
scala> CW7a.occurrences(List("a", "b", "b"))
202
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
    50
...
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
    51
\end{lstlisting}%$
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
    52
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
    53
\subsection*{Hints}
6
aae256985251 updated
Christian Urban <urbanc@in.tum.de>
parents:
diff changeset
    54
203
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    55
\noindent
346
663c2a9108d1 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 333
diff changeset
    56
\textbf{For the Preliminary Part:} useful operations involving regular
203
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    57
expressions:
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    58
\[\texttt{reg.findAllIn(s).toList}\]
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    59
\noindent finds all
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    60
substrings in \texttt{s} according to a regular regular expression
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    61
\texttt{reg}; useful list operations: \texttt{.distinct}
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    62
removing duplicates from a list, \texttt{.count} counts the number of
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    63
elements in a list that satisfy some condition, \texttt{.toMap}
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    64
transfers a list of pairs into a Map, \texttt{.sum} adds up a list of
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    65
integers, \texttt{.max} calculates the maximum of a list.\bigskip
45
8399976b77fe updated
Christian Urban <urbanc@in.tum.de>
parents: 42
diff changeset
    66
39
c6fe374a5fca updated
Christian Urban <urbanc@in.tum.de>
parents: 38
diff changeset
    67
c6fe374a5fca updated
Christian Urban <urbanc@in.tum.de>
parents: 38
diff changeset
    68
202
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
    69
\newpage
346
663c2a9108d1 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 333
diff changeset
    70
\subsection*{Preliminary Part (3 Marks, file docdiff.scala)}
6
aae256985251 updated
Christian Urban <urbanc@in.tum.de>
parents:
diff changeset
    71
203
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    72
It seems source code plagiarism---stealing and submitting someone
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    73
else's code---is a serious problem at other
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    74
universities.\footnote{Surely, King's students, after all their
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    75
  instructions and warnings, would never commit such an offence. Yes?}
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    76
Detecting such plagiarism is time-consuming and disheartening for
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    77
lecturers at those universities. To aid these poor souls, let's
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    78
implement in this part a program that determines the similarity
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    79
between two documents (be they source code or texts in English). A
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    80
document will be represented as a list of strings.
6
aae256985251 updated
Christian Urban <urbanc@in.tum.de>
parents:
diff changeset
    81
aae256985251 updated
Christian Urban <urbanc@in.tum.de>
parents:
diff changeset
    82
202
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
    83
\subsection*{Tasks}
45
8399976b77fe updated
Christian Urban <urbanc@in.tum.de>
parents: 42
diff changeset
    84
8399976b77fe updated
Christian Urban <urbanc@in.tum.de>
parents: 42
diff changeset
    85
\begin{itemize}
203
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    86
\item[(1)] Implement a function that `cleans' a string by finding all
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    87
  (proper) words in this string. For this use the regular expression
276
52faee6d0be2 updated
Christian Urban <urbanc@in.tum.de>
parents: 268
diff changeset
    88
  \texttt{\textbackslash{}w+} for recognising words and the library function
52faee6d0be2 updated
Christian Urban <urbanc@in.tum.de>
parents: 268
diff changeset
    89
  \texttt{findAllIn}. The function should return a document (a list of
203
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    90
  strings).\\
346
663c2a9108d1 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 333
diff changeset
    91
  \mbox{}\hfill [0.5 Marks]
45
8399976b77fe updated
Christian Urban <urbanc@in.tum.de>
parents: 42
diff changeset
    92
203
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    93
\item[(2)] In order to compute the overlap between two documents, we
202
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
    94
  associate each document with a \texttt{Map}. This Map represents the
203
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    95
  strings in a document and how many times these strings occur in the
202
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
    96
  document. A simple (though slightly inefficient) method for counting
203
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    97
  the number of string-occurrences in a document is as follows: remove
202
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
    98
  all duplicates from the document; for each of these (unique)
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
    99
  strings, count how many times they occur in the original document.
203
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
   100
  Return a Map associating strings with occurrences. For example
6
aae256985251 updated
Christian Urban <urbanc@in.tum.de>
parents:
diff changeset
   101
45
8399976b77fe updated
Christian Urban <urbanc@in.tum.de>
parents: 42
diff changeset
   102
  \begin{center}
203
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
   103
  \pcode{occurrences(List("a", "b", "b", "c", "d"))}
45
8399976b77fe updated
Christian Urban <urbanc@in.tum.de>
parents: 42
diff changeset
   104
  \end{center}
8399976b77fe updated
Christian Urban <urbanc@in.tum.de>
parents: 42
diff changeset
   105
202
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
   106
  produces \pcode{Map(a -> 1, b -> 2, c -> 1, d -> 1)} and
45
8399976b77fe updated
Christian Urban <urbanc@in.tum.de>
parents: 42
diff changeset
   107
8399976b77fe updated
Christian Urban <urbanc@in.tum.de>
parents: 42
diff changeset
   108
  \begin{center}
203
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
   109
  \pcode{occurrences(List("d", "b", "d", "b", "d"))}
48
7a6a75ea9738 updated
Christian Urban <urbanc@in.tum.de>
parents: 46
diff changeset
   110
  \end{center}
45
8399976b77fe updated
Christian Urban <urbanc@in.tum.de>
parents: 42
diff changeset
   111
202
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
   112
  produces \pcode{Map(d -> 3, b -> 2)}.\hfill[1 Mark]
6
aae256985251 updated
Christian Urban <urbanc@in.tum.de>
parents:
diff changeset
   113
203
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
   114
\item[(3)] You can think of the Maps calculated under (2) as memory-efficient
202
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
   115
  representations of sparse ``vectors''. In this subtask you need to
203
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
   116
  implement the \emph{product} of two such vectors, sometimes also called
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
   117
  \emph{dot product} of two vectors.\footnote{\url{https://en.wikipedia.org/wiki/Dot_product}}
148
ead6089209ba updated
Christian Urban <urbanc@in.tum.de>
parents: 147
diff changeset
   118
203
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
   119
  For this dot product, implement a function that takes two documents
202
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
   120
  (\texttt{List[String]}) as arguments. The function first calculates
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
   121
  the (unique) strings in both. For each string, it multiplies the
203
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
   122
  corresponding occurrences in each document. If a string does not
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
   123
  occur in one of the documents, then the product for this string is zero. At the end
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
   124
  you need to add up all products. For the two documents in (2) the dot
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
   125
  product is 7, because
45
8399976b77fe updated
Christian Urban <urbanc@in.tum.de>
parents: 42
diff changeset
   126
8399976b77fe updated
Christian Urban <urbanc@in.tum.de>
parents: 42
diff changeset
   127
  \[
202
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
   128
    \underbrace{1 * 0}_{"a"} \;\;+\;\;
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
   129
    \underbrace{2 * 2}_{"b"} \;\;+\;\;
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
   130
    \underbrace{1 * 0}_{"c"} \;\;+\;\;
203
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
   131
    \underbrace{1 * 3}_{"d"} \qquad = 7
202
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
   132
  \]  
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
   133
  
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
   134
  \hfill\mbox{[1 Mark]}
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
   135
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
   136
\item[(4)] Implement first a function that calculates the overlap
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
   137
  between two documents, say $d_1$ and $d_2$, according to the formula
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
   138
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
   139
  \[
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
   140
  \texttt{overlap}(d_1, d_2) = \frac{d_1 \cdot d_2}{max(d_1^2, d_2^2)}  
45
8399976b77fe updated
Christian Urban <urbanc@in.tum.de>
parents: 42
diff changeset
   141
  \]
8399976b77fe updated
Christian Urban <urbanc@in.tum.de>
parents: 42
diff changeset
   142
316
8b57dd326a91 updated
Christian Urban <urbanc@in.tum.de>
parents: 306
diff changeset
   143
  where $d_1^2$ means $d_1 \cdot d_1$ and so on.
203
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
   144
  You can expect this function to return a \texttt{Double} between 0 and 1. The
202
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
   145
  overlap between the lists in (2) is $0.5384615384615384$.
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
   146
203
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
   147
  Second, implement a function that calculates the similarity of
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
   148
  two strings, by first extracting the substrings using the clean
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
   149
  function from (1)
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
   150
  and then calculating the overlap of the resulting documents.\\
346
663c2a9108d1 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 333
diff changeset
   151
  \mbox{}\hfill\mbox{[0.5 Marks]}
663c2a9108d1 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 333
diff changeset
   152
\end{itemize}
203
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
   153
6
aae256985251 updated
Christian Urban <urbanc@in.tum.de>
parents:
diff changeset
   154
268
e43f7e92ba26 updated
Christian Urban <urbanc@in.tum.de>
parents: 264
diff changeset
   155
\end{document} 
6
aae256985251 updated
Christian Urban <urbanc@in.tum.de>
parents:
diff changeset
   156
aae256985251 updated
Christian Urban <urbanc@in.tum.de>
parents:
diff changeset
   157
%%% Local Variables: 
aae256985251 updated
Christian Urban <urbanc@in.tum.de>
parents:
diff changeset
   158
%%% mode: latex
aae256985251 updated
Christian Urban <urbanc@in.tum.de>
parents:
diff changeset
   159
%%% TeX-master: t
aae256985251 updated
Christian Urban <urbanc@in.tum.de>
parents:
diff changeset
   160
%%% End: