cws/core_cw02.tex
changeset 396 3ffe978a5664
parent 356 d1046d9d3213
child 411 f55742af79ef
equal deleted inserted replaced
395:017f621f5835 396:3ffe978a5664
       
     1 % !TEX program = xelatex
       
     2 \documentclass{article}
       
     3 \usepackage{../style}
       
     4 \usepackage{disclaimer}
       
     5 \usepackage{../langs}
       
     6 
       
     7 \begin{document}
       
     8 
       
     9 
       
    10 %% should ask to lower case the words.
       
    11 
       
    12 \section*{Core Part 2 (Scala, 3 Marks)}
       
    13 
       
    14 \mbox{}\hfill\textit{``What one programmer can do in one month,}\\
       
    15 \mbox{}\hfill\textit{two programmers can do in two months.''}\smallskip\\
       
    16 \mbox{}\hfill\textit{ --- Frederick P.~Brooks (author of The Mythical Man-Month)}\bigskip\medskip
       
    17 
       
    18 \IMPORTANT{You are asked to implement a Scala program for measuring similarity in
       
    19   texts. The preliminary part is due on \cwSEVEN{} at 5pm and worth 3\%.
       
    20   Any 1\% you achieve in the preliminary part counts as your ``weekly engagement''.}
       
    21 
       
    22 \noindent
       
    23 Also note that the running time of each part will be restricted to a
       
    24 maximum of 30 seconds on my laptop.
       
    25 
       
    26 \DISCLAIMER{}
       
    27 
       
    28 
       
    29 \subsection*{Reference Implementation}
       
    30 
       
    31 Like the C++ part, the Scala part works like this: you
       
    32 push your files to GitHub and receive (after sometimes a long delay) some
       
    33 automated feedback. In the end we will take a snapshot of the submitted files and
       
    34 apply an automated marking script to them.\medskip
       
    35 
       
    36 \noindent
       
    37 In addition, the Scala part comes with reference
       
    38 implementations in form of \texttt{jar}-files. This allows you to run
       
    39 any test cases on your own computer. For example you can call Scala on
       
    40 the command line with the option \texttt{-cp docdiff.jar} and then
       
    41 query any function from the template file. Say you want to find out
       
    42 what the function \texttt{occurrences} produces: for this you just need
       
    43 to prefix it with the object name \texttt{C2}.  If you want to find out what
       
    44 these functions produce for the list \texttt{List("a", "b", "b")},
       
    45 you would type something like:
       
    46 
       
    47 \begin{lstlisting}[language={},numbers=none,basicstyle=\ttfamily\small]
       
    48 $ scala -cp docdiff.jar
       
    49   
       
    50 scala> C2.occurrences(List("a", "b", "b"))
       
    51 ...
       
    52 \end{lstlisting}%$
       
    53 
       
    54 \subsection*{Hints}
       
    55 
       
    56 \noindent
       
    57 \textbf{For the Core Part 2:} useful operations involving regular
       
    58 expressions:
       
    59 \[\texttt{reg.findAllIn(s).toList}\]
       
    60 \noindent finds all
       
    61 substrings in \texttt{s} according to a regular regular expression
       
    62 \texttt{reg}; useful list operations: \texttt{.distinct}
       
    63 removing duplicates from a list, \texttt{.count} counts the number of
       
    64 elements in a list that satisfy some condition, \texttt{.toMap}
       
    65 transfers a list of pairs into a Map, \texttt{.sum} adds up a list of
       
    66 integers, \texttt{.max} calculates the maximum of a list.\bigskip
       
    67 
       
    68 
       
    69 
       
    70 \newpage
       
    71 \subsection*{Core Part 2 (3 Marks, file docdiff.scala)}
       
    72 
       
    73 It seems plagiarism---stealing and submitting someone
       
    74 else's code---is a serious problem at other
       
    75 universities.\footnote{Surely, King's students, after all their
       
    76   instructions and warnings, would never commit such an offence. Yes?}
       
    77 Detecting such plagiarism is time-consuming and disheartening for
       
    78 lecturers at those universities. To aid these poor souls, let's
       
    79 implement in this part a program that determines the similarity
       
    80 between two documents (be they source code or texts in English). A
       
    81 document will be represented as a list of strings.
       
    82 
       
    83 
       
    84 \subsection*{Tasks}
       
    85 
       
    86 \begin{itemize}
       
    87 \item[(1)] Implement a function that `cleans' a string by finding all
       
    88   (proper) words in the string. For this use the regular expression
       
    89   \texttt{\textbackslash{}w+} for recognising words and the library function
       
    90   \texttt{findAllIn}. The function should return a document (a list of
       
    91   strings).
       
    92   \mbox{}\hfill\mbox{[0.5 Marks]}
       
    93 
       
    94 \item[(2)] In order to compute the overlap between two documents, we
       
    95   associate each document with a \texttt{Map}. This Map represents the
       
    96   strings in a document and how many times these strings occur in the
       
    97   document. A simple (though slightly inefficient) method for counting
       
    98   the number of string-occurrences in a document is as follows: remove
       
    99   all duplicates from the document; for each of these (unique)
       
   100   strings, count how many times they occur in the original document.
       
   101   Return a Map associating strings with occurrences. For example
       
   102 
       
   103   \begin{center}
       
   104   \pcode{occurrences(List("a", "b", "b", "c", "d"))}
       
   105   \end{center}
       
   106 
       
   107   produces \pcode{Map(a -> 1, b -> 2, c -> 1, d -> 1)} and
       
   108 
       
   109   \begin{center}
       
   110   \pcode{occurrences(List("d", "b", "d", "b", "d"))}
       
   111   \end{center}
       
   112 
       
   113   produces \pcode{Map(d -> 3, b -> 2)}.\hfill[1 Mark]
       
   114 
       
   115 \item[(3)] You can think of the Maps calculated under (2) as memory-efficient
       
   116   representations of sparse ``vectors''. In this subtask you need to
       
   117   implement the \emph{product} of two such vectors, sometimes also called
       
   118   \emph{dot product} of two vectors.\footnote{\url{https://en.wikipedia.org/wiki/Dot_product}}
       
   119 
       
   120   For this dot product, implement a function that takes two documents
       
   121   (\texttt{List[String]}) as arguments. The function first calculates
       
   122   the (unique) strings in both. For each string, it multiplies the
       
   123   corresponding occurrences in each document. If a string does not
       
   124   occur in one of the documents, then the product for this string is zero. At the end
       
   125   you need to add up all products. For the two documents in (2) the dot
       
   126   product is 7, because
       
   127 
       
   128   \[
       
   129     \underbrace{1 * 0}_{"a"} \;\;+\;\;
       
   130     \underbrace{2 * 2}_{"b"} \;\;+\;\;
       
   131     \underbrace{1 * 0}_{"c"} \;\;+\;\;
       
   132     \underbrace{1 * 3}_{"d"} \qquad = 7
       
   133   \]  
       
   134   
       
   135   \hfill\mbox{[1 Mark]}
       
   136 
       
   137 \item[(4)] Implement first a function that calculates the overlap
       
   138   between two documents, say $d_1$ and $d_2$, according to the formula
       
   139 
       
   140   \[
       
   141   \texttt{overlap}(d_1, d_2) = \frac{d_1 \cdot d_2}{max(d_1^2, d_2^2)}  
       
   142   \]
       
   143 
       
   144   where $d_1^2$ means $d_1 \cdot d_1$ and so on.
       
   145   You can expect this function to return a \texttt{Double} between 0 and 1. The
       
   146   overlap between the lists in (2) is $0.5384615384615384$.
       
   147 
       
   148   Second, implement a function that calculates the similarity of
       
   149   two strings, by first extracting the substrings using the clean
       
   150   function from (1)
       
   151   and then calculating the overlap of the resulting documents.\\
       
   152   \mbox{}\hfill\mbox{[0.5 Marks]}
       
   153 \end{itemize}
       
   154 
       
   155 
       
   156 \end{document} 
       
   157 
       
   158 %%% Local Variables: 
       
   159 %%% mode: latex
       
   160 %%% TeX-master: t
       
   161 %%% End: