cws/core_cw02.tex
author Christian Urban <christian.urban@kcl.ac.uk>
Fri, 05 Nov 2021 16:47:55 +0000
changeset 396 3ffe978a5664
parent 356 cws/pre_cw02.tex@d1046d9d3213
child 411 f55742af79ef
permissions -rw-r--r--
updated
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
268
e43f7e92ba26 updated
Christian Urban <urbanc@in.tum.de>
parents: 264
diff changeset
     1
% !TEX program = xelatex
6
aae256985251 updated
Christian Urban <urbanc@in.tum.de>
parents:
diff changeset
     2
\documentclass{article}
39
c6fe374a5fca updated
Christian Urban <urbanc@in.tum.de>
parents: 38
diff changeset
     3
\usepackage{../style}
166
780c40aaad27 updated
Christian Urban <urbanc@in.tum.de>
parents: 163
diff changeset
     4
\usepackage{disclaimer}
202
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
     5
\usepackage{../langs}
6
aae256985251 updated
Christian Urban <urbanc@in.tum.de>
parents:
diff changeset
     6
aae256985251 updated
Christian Urban <urbanc@in.tum.de>
parents:
diff changeset
     7
\begin{document}
aae256985251 updated
Christian Urban <urbanc@in.tum.de>
parents:
diff changeset
     8
aae256985251 updated
Christian Urban <urbanc@in.tum.de>
parents:
diff changeset
     9
329
8a34b2ebc8cc updated
Christian Urban <urbanc@in.tum.de>
parents: 316
diff changeset
    10
%% should ask to lower case the words.
8a34b2ebc8cc updated
Christian Urban <urbanc@in.tum.de>
parents: 316
diff changeset
    11
396
3ffe978a5664 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 356
diff changeset
    12
\section*{Core Part 2 (Scala, 3 Marks)}
6
aae256985251 updated
Christian Urban <urbanc@in.tum.de>
parents:
diff changeset
    13
264
ecd989eee8bd updated
Christian Urban <urbanc@in.tum.de>
parents: 259
diff changeset
    14
\mbox{}\hfill\textit{``What one programmer can do in one month,}\\
ecd989eee8bd updated
Christian Urban <urbanc@in.tum.de>
parents: 259
diff changeset
    15
\mbox{}\hfill\textit{two programmers can do in two months.''}\smallskip\\
276
52faee6d0be2 updated
Christian Urban <urbanc@in.tum.de>
parents: 268
diff changeset
    16
\mbox{}\hfill\textit{ --- Frederick P.~Brooks (author of The Mythical Man-Month)}\bigskip\medskip
264
ecd989eee8bd updated
Christian Urban <urbanc@in.tum.de>
parents: 259
diff changeset
    17
346
663c2a9108d1 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 333
diff changeset
    18
\IMPORTANT{You are asked to implement a Scala program for measuring similarity in
355
bc3980949af2 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 346
diff changeset
    19
  texts. The preliminary part is due on \cwSEVEN{} at 5pm and worth 3\%.
bc3980949af2 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 346
diff changeset
    20
  Any 1\% you achieve in the preliminary part counts as your ``weekly engagement''.}
202
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
    21
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
    22
\noindent
144
716042628398 updated
Christian Urban <urbanc@in.tum.de>
parents: 110
diff changeset
    23
Also note that the running time of each part will be restricted to a
202
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
    24
maximum of 30 seconds on my laptop.
39
c6fe374a5fca updated
Christian Urban <urbanc@in.tum.de>
parents: 38
diff changeset
    25
166
780c40aaad27 updated
Christian Urban <urbanc@in.tum.de>
parents: 163
diff changeset
    26
\DISCLAIMER{}
39
c6fe374a5fca updated
Christian Urban <urbanc@in.tum.de>
parents: 38
diff changeset
    27
c6fe374a5fca updated
Christian Urban <urbanc@in.tum.de>
parents: 38
diff changeset
    28
202
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
    29
\subsection*{Reference Implementation}
45
8399976b77fe updated
Christian Urban <urbanc@in.tum.de>
parents: 42
diff changeset
    30
306
Christian Urban <urbanc@in.tum.de>
parents: 301
diff changeset
    31
Like the C++ part, the Scala part works like this: you
202
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
    32
push your files to GitHub and receive (after sometimes a long delay) some
306
Christian Urban <urbanc@in.tum.de>
parents: 301
diff changeset
    33
automated feedback. In the end we will take a snapshot of the submitted files and
268
e43f7e92ba26 updated
Christian Urban <urbanc@in.tum.de>
parents: 264
diff changeset
    34
apply an automated marking script to them.\medskip
45
8399976b77fe updated
Christian Urban <urbanc@in.tum.de>
parents: 42
diff changeset
    35
268
e43f7e92ba26 updated
Christian Urban <urbanc@in.tum.de>
parents: 264
diff changeset
    36
\noindent
306
Christian Urban <urbanc@in.tum.de>
parents: 301
diff changeset
    37
In addition, the Scala part comes with reference
Christian Urban <urbanc@in.tum.de>
parents: 301
diff changeset
    38
implementations in form of \texttt{jar}-files. This allows you to run
202
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
    39
any test cases on your own computer. For example you can call Scala on
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
    40
the command line with the option \texttt{-cp docdiff.jar} and then
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
    41
query any function from the template file. Say you want to find out
203
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    42
what the function \texttt{occurrences} produces: for this you just need
396
3ffe978a5664 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 356
diff changeset
    43
to prefix it with the object name \texttt{C2}.  If you want to find out what
202
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
    44
these functions produce for the list \texttt{List("a", "b", "b")},
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
    45
you would type something like:
6
aae256985251 updated
Christian Urban <urbanc@in.tum.de>
parents:
diff changeset
    46
202
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
    47
\begin{lstlisting}[language={},numbers=none,basicstyle=\ttfamily\small]
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
    48
$ scala -cp docdiff.jar
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
    49
  
396
3ffe978a5664 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 356
diff changeset
    50
scala> C2.occurrences(List("a", "b", "b"))
202
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
    51
...
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
    52
\end{lstlisting}%$
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
    53
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
    54
\subsection*{Hints}
6
aae256985251 updated
Christian Urban <urbanc@in.tum.de>
parents:
diff changeset
    55
203
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    56
\noindent
396
3ffe978a5664 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 356
diff changeset
    57
\textbf{For the Core Part 2:} useful operations involving regular
203
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    58
expressions:
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    59
\[\texttt{reg.findAllIn(s).toList}\]
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    60
\noindent finds all
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    61
substrings in \texttt{s} according to a regular regular expression
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    62
\texttt{reg}; useful list operations: \texttt{.distinct}
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    63
removing duplicates from a list, \texttt{.count} counts the number of
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    64
elements in a list that satisfy some condition, \texttt{.toMap}
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    65
transfers a list of pairs into a Map, \texttt{.sum} adds up a list of
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    66
integers, \texttt{.max} calculates the maximum of a list.\bigskip
45
8399976b77fe updated
Christian Urban <urbanc@in.tum.de>
parents: 42
diff changeset
    67
39
c6fe374a5fca updated
Christian Urban <urbanc@in.tum.de>
parents: 38
diff changeset
    68
c6fe374a5fca updated
Christian Urban <urbanc@in.tum.de>
parents: 38
diff changeset
    69
202
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
    70
\newpage
396
3ffe978a5664 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 356
diff changeset
    71
\subsection*{Core Part 2 (3 Marks, file docdiff.scala)}
6
aae256985251 updated
Christian Urban <urbanc@in.tum.de>
parents:
diff changeset
    72
396
3ffe978a5664 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 356
diff changeset
    73
It seems plagiarism---stealing and submitting someone
203
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    74
else's code---is a serious problem at other
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    75
universities.\footnote{Surely, King's students, after all their
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    76
  instructions and warnings, would never commit such an offence. Yes?}
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    77
Detecting such plagiarism is time-consuming and disheartening for
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    78
lecturers at those universities. To aid these poor souls, let's
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    79
implement in this part a program that determines the similarity
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    80
between two documents (be they source code or texts in English). A
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    81
document will be represented as a list of strings.
6
aae256985251 updated
Christian Urban <urbanc@in.tum.de>
parents:
diff changeset
    82
aae256985251 updated
Christian Urban <urbanc@in.tum.de>
parents:
diff changeset
    83
202
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
    84
\subsection*{Tasks}
45
8399976b77fe updated
Christian Urban <urbanc@in.tum.de>
parents: 42
diff changeset
    85
8399976b77fe updated
Christian Urban <urbanc@in.tum.de>
parents: 42
diff changeset
    86
\begin{itemize}
203
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    87
\item[(1)] Implement a function that `cleans' a string by finding all
396
3ffe978a5664 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 356
diff changeset
    88
  (proper) words in the string. For this use the regular expression
276
52faee6d0be2 updated
Christian Urban <urbanc@in.tum.de>
parents: 268
diff changeset
    89
  \texttt{\textbackslash{}w+} for recognising words and the library function
52faee6d0be2 updated
Christian Urban <urbanc@in.tum.de>
parents: 268
diff changeset
    90
  \texttt{findAllIn}. The function should return a document (a list of
396
3ffe978a5664 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 356
diff changeset
    91
  strings).
3ffe978a5664 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 356
diff changeset
    92
  \mbox{}\hfill\mbox{[0.5 Marks]}
45
8399976b77fe updated
Christian Urban <urbanc@in.tum.de>
parents: 42
diff changeset
    93
203
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    94
\item[(2)] In order to compute the overlap between two documents, we
202
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
    95
  associate each document with a \texttt{Map}. This Map represents the
203
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    96
  strings in a document and how many times these strings occur in the
202
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
    97
  document. A simple (though slightly inefficient) method for counting
203
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
    98
  the number of string-occurrences in a document is as follows: remove
202
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
    99
  all duplicates from the document; for each of these (unique)
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
   100
  strings, count how many times they occur in the original document.
203
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
   101
  Return a Map associating strings with occurrences. For example
6
aae256985251 updated
Christian Urban <urbanc@in.tum.de>
parents:
diff changeset
   102
45
8399976b77fe updated
Christian Urban <urbanc@in.tum.de>
parents: 42
diff changeset
   103
  \begin{center}
203
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
   104
  \pcode{occurrences(List("a", "b", "b", "c", "d"))}
45
8399976b77fe updated
Christian Urban <urbanc@in.tum.de>
parents: 42
diff changeset
   105
  \end{center}
8399976b77fe updated
Christian Urban <urbanc@in.tum.de>
parents: 42
diff changeset
   106
202
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
   107
  produces \pcode{Map(a -> 1, b -> 2, c -> 1, d -> 1)} and
45
8399976b77fe updated
Christian Urban <urbanc@in.tum.de>
parents: 42
diff changeset
   108
8399976b77fe updated
Christian Urban <urbanc@in.tum.de>
parents: 42
diff changeset
   109
  \begin{center}
203
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
   110
  \pcode{occurrences(List("d", "b", "d", "b", "d"))}
48
7a6a75ea9738 updated
Christian Urban <urbanc@in.tum.de>
parents: 46
diff changeset
   111
  \end{center}
45
8399976b77fe updated
Christian Urban <urbanc@in.tum.de>
parents: 42
diff changeset
   112
202
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
   113
  produces \pcode{Map(d -> 3, b -> 2)}.\hfill[1 Mark]
6
aae256985251 updated
Christian Urban <urbanc@in.tum.de>
parents:
diff changeset
   114
203
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
   115
\item[(3)] You can think of the Maps calculated under (2) as memory-efficient
202
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
   116
  representations of sparse ``vectors''. In this subtask you need to
203
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
   117
  implement the \emph{product} of two such vectors, sometimes also called
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
   118
  \emph{dot product} of two vectors.\footnote{\url{https://en.wikipedia.org/wiki/Dot_product}}
148
ead6089209ba updated
Christian Urban <urbanc@in.tum.de>
parents: 147
diff changeset
   119
203
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
   120
  For this dot product, implement a function that takes two documents
202
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
   121
  (\texttt{List[String]}) as arguments. The function first calculates
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
   122
  the (unique) strings in both. For each string, it multiplies the
203
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
   123
  corresponding occurrences in each document. If a string does not
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
   124
  occur in one of the documents, then the product for this string is zero. At the end
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
   125
  you need to add up all products. For the two documents in (2) the dot
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
   126
  product is 7, because
45
8399976b77fe updated
Christian Urban <urbanc@in.tum.de>
parents: 42
diff changeset
   127
8399976b77fe updated
Christian Urban <urbanc@in.tum.de>
parents: 42
diff changeset
   128
  \[
202
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
   129
    \underbrace{1 * 0}_{"a"} \;\;+\;\;
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
   130
    \underbrace{2 * 2}_{"b"} \;\;+\;\;
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
   131
    \underbrace{1 * 0}_{"c"} \;\;+\;\;
203
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
   132
    \underbrace{1 * 3}_{"d"} \qquad = 7
202
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
   133
  \]  
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
   134
  
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
   135
  \hfill\mbox{[1 Mark]}
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
   136
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
   137
\item[(4)] Implement first a function that calculates the overlap
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
   138
  between two documents, say $d_1$ and $d_2$, according to the formula
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
   139
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
   140
  \[
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
   141
  \texttt{overlap}(d_1, d_2) = \frac{d_1 \cdot d_2}{max(d_1^2, d_2^2)}  
45
8399976b77fe updated
Christian Urban <urbanc@in.tum.de>
parents: 42
diff changeset
   142
  \]
8399976b77fe updated
Christian Urban <urbanc@in.tum.de>
parents: 42
diff changeset
   143
316
8b57dd326a91 updated
Christian Urban <urbanc@in.tum.de>
parents: 306
diff changeset
   144
  where $d_1^2$ means $d_1 \cdot d_1$ and so on.
203
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
   145
  You can expect this function to return a \texttt{Double} between 0 and 1. The
202
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
   146
  overlap between the lists in (2) is $0.5384615384615384$.
f7bcb27d1940 updated
Christian Urban <urbanc@in.tum.de>
parents: 166
diff changeset
   147
203
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
   148
  Second, implement a function that calculates the similarity of
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
   149
  two strings, by first extracting the substrings using the clean
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
   150
  function from (1)
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
   151
  and then calculating the overlap of the resulting documents.\\
346
663c2a9108d1 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 333
diff changeset
   152
  \mbox{}\hfill\mbox{[0.5 Marks]}
663c2a9108d1 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 333
diff changeset
   153
\end{itemize}
203
eb188f9ac038 updated
Christian Urban <urbanc@in.tum.de>
parents: 202
diff changeset
   154
6
aae256985251 updated
Christian Urban <urbanc@in.tum.de>
parents:
diff changeset
   155
268
e43f7e92ba26 updated
Christian Urban <urbanc@in.tum.de>
parents: 264
diff changeset
   156
\end{document} 
6
aae256985251 updated
Christian Urban <urbanc@in.tum.de>
parents:
diff changeset
   157
aae256985251 updated
Christian Urban <urbanc@in.tum.de>
parents:
diff changeset
   158
%%% Local Variables: 
aae256985251 updated
Christian Urban <urbanc@in.tum.de>
parents:
diff changeset
   159
%%% mode: latex
aae256985251 updated
Christian Urban <urbanc@in.tum.de>
parents:
diff changeset
   160
%%% TeX-master: t
aae256985251 updated
Christian Urban <urbanc@in.tum.de>
parents:
diff changeset
   161
%%% End: