cws/core_cw02.tex
author Christian Urban <christian.urban@kcl.ac.uk>
Mon, 08 Nov 2021 01:52:56 +0000
changeset 406 ad24f50c484d
parent 396 3ffe978a5664
child 411 f55742af79ef
permissions -rw-r--r--
updated

% !TEX program = xelatex
\documentclass{article}
\usepackage{../style}
\usepackage{disclaimer}
\usepackage{../langs}

\begin{document}


%% should ask to lower case the words.

\section*{Core Part 2 (Scala, 3 Marks)}

\mbox{}\hfill\textit{``What one programmer can do in one month,}\\
\mbox{}\hfill\textit{two programmers can do in two months.''}\smallskip\\
\mbox{}\hfill\textit{ --- Frederick P.~Brooks (author of The Mythical Man-Month)}\bigskip\medskip

\IMPORTANT{You are asked to implement a Scala program for measuring similarity in
  texts. The preliminary part is due on \cwSEVEN{} at 5pm and worth 3\%.
  Any 1\% you achieve in the preliminary part counts as your ``weekly engagement''.}

\noindent
Also note that the running time of each part will be restricted to a
maximum of 30 seconds on my laptop.

\DISCLAIMER{}


\subsection*{Reference Implementation}

Like the C++ part, the Scala part works like this: you
push your files to GitHub and receive (after sometimes a long delay) some
automated feedback. In the end we will take a snapshot of the submitted files and
apply an automated marking script to them.\medskip

\noindent
In addition, the Scala part comes with reference
implementations in form of \texttt{jar}-files. This allows you to run
any test cases on your own computer. For example you can call Scala on
the command line with the option \texttt{-cp docdiff.jar} and then
query any function from the template file. Say you want to find out
what the function \texttt{occurrences} produces: for this you just need
to prefix it with the object name \texttt{C2}.  If you want to find out what
these functions produce for the list \texttt{List("a", "b", "b")},
you would type something like:

\begin{lstlisting}[language={},numbers=none,basicstyle=\ttfamily\small]
$ scala -cp docdiff.jar
  
scala> C2.occurrences(List("a", "b", "b"))
...
\end{lstlisting}%$

\subsection*{Hints}

\noindent
\textbf{For the Core Part 2:} useful operations involving regular
expressions:
\[\texttt{reg.findAllIn(s).toList}\]
\noindent finds all
substrings in \texttt{s} according to a regular regular expression
\texttt{reg}; useful list operations: \texttt{.distinct}
removing duplicates from a list, \texttt{.count} counts the number of
elements in a list that satisfy some condition, \texttt{.toMap}
transfers a list of pairs into a Map, \texttt{.sum} adds up a list of
integers, \texttt{.max} calculates the maximum of a list.\bigskip



\newpage
\subsection*{Core Part 2 (3 Marks, file docdiff.scala)}

It seems plagiarism---stealing and submitting someone
else's code---is a serious problem at other
universities.\footnote{Surely, King's students, after all their
  instructions and warnings, would never commit such an offence. Yes?}
Detecting such plagiarism is time-consuming and disheartening for
lecturers at those universities. To aid these poor souls, let's
implement in this part a program that determines the similarity
between two documents (be they source code or texts in English). A
document will be represented as a list of strings.


\subsection*{Tasks}

\begin{itemize}
\item[(1)] Implement a function that `cleans' a string by finding all
  (proper) words in the string. For this use the regular expression
  \texttt{\textbackslash{}w+} for recognising words and the library function
  \texttt{findAllIn}. The function should return a document (a list of
  strings).
  \mbox{}\hfill\mbox{[0.5 Marks]}

\item[(2)] In order to compute the overlap between two documents, we
  associate each document with a \texttt{Map}. This Map represents the
  strings in a document and how many times these strings occur in the
  document. A simple (though slightly inefficient) method for counting
  the number of string-occurrences in a document is as follows: remove
  all duplicates from the document; for each of these (unique)
  strings, count how many times they occur in the original document.
  Return a Map associating strings with occurrences. For example

  \begin{center}
  \pcode{occurrences(List("a", "b", "b", "c", "d"))}
  \end{center}

  produces \pcode{Map(a -> 1, b -> 2, c -> 1, d -> 1)} and

  \begin{center}
  \pcode{occurrences(List("d", "b", "d", "b", "d"))}
  \end{center}

  produces \pcode{Map(d -> 3, b -> 2)}.\hfill[1 Mark]

\item[(3)] You can think of the Maps calculated under (2) as memory-efficient
  representations of sparse ``vectors''. In this subtask you need to
  implement the \emph{product} of two such vectors, sometimes also called
  \emph{dot product} of two vectors.\footnote{\url{https://en.wikipedia.org/wiki/Dot_product}}

  For this dot product, implement a function that takes two documents
  (\texttt{List[String]}) as arguments. The function first calculates
  the (unique) strings in both. For each string, it multiplies the
  corresponding occurrences in each document. If a string does not
  occur in one of the documents, then the product for this string is zero. At the end
  you need to add up all products. For the two documents in (2) the dot
  product is 7, because

  \[
    \underbrace{1 * 0}_{"a"} \;\;+\;\;
    \underbrace{2 * 2}_{"b"} \;\;+\;\;
    \underbrace{1 * 0}_{"c"} \;\;+\;\;
    \underbrace{1 * 3}_{"d"} \qquad = 7
  \]  
  
  \hfill\mbox{[1 Mark]}

\item[(4)] Implement first a function that calculates the overlap
  between two documents, say $d_1$ and $d_2$, according to the formula

  \[
  \texttt{overlap}(d_1, d_2) = \frac{d_1 \cdot d_2}{max(d_1^2, d_2^2)}  
  \]

  where $d_1^2$ means $d_1 \cdot d_1$ and so on.
  You can expect this function to return a \texttt{Double} between 0 and 1. The
  overlap between the lists in (2) is $0.5384615384615384$.

  Second, implement a function that calculates the similarity of
  two strings, by first extracting the substrings using the clean
  function from (1)
  and then calculating the overlap of the resulting documents.\\
  \mbox{}\hfill\mbox{[0.5 Marks]}
\end{itemize}


\end{document} 

%%% Local Variables: 
%%% mode: latex
%%% TeX-master: t
%%% End: