cws/pre_cw02.tex
changeset 346 663c2a9108d1
parent 333 24bc76d97db2
child 355 bc3980949af2
equal deleted inserted replaced
345:40657f9a4e4a 346:663c2a9108d1
       
     1 % !TEX program = xelatex
       
     2 \documentclass{article}
       
     3 \usepackage{../style}
       
     4 \usepackage{disclaimer}
       
     5 \usepackage{../langs}
       
     6 
       
     7 \begin{document}
       
     8 
       
     9 
       
    10 %% should ask to lower case the words.
       
    11 
       
    12 \section*{Preliminary Part 7 (Scala, 3 Marks)}
       
    13 
       
    14 \mbox{}\hfill\textit{``What one programmer can do in one month,}\\
       
    15 \mbox{}\hfill\textit{two programmers can do in two months.''}\smallskip\\
       
    16 \mbox{}\hfill\textit{ --- Frederick P.~Brooks (author of The Mythical Man-Month)}\bigskip\medskip
       
    17 
       
    18 \IMPORTANT{You are asked to implement a Scala program for measuring similarity in
       
    19   texts. The preliminary part is due on \cwSEVEN{} at 5pm and worth 3\%.}
       
    20 
       
    21 \noindent
       
    22 Also note that the running time of each part will be restricted to a
       
    23 maximum of 30 seconds on my laptop.
       
    24 
       
    25 \DISCLAIMER{}
       
    26 
       
    27 
       
    28 \subsection*{Reference Implementation}
       
    29 
       
    30 Like the C++ part, the Scala part works like this: you
       
    31 push your files to GitHub and receive (after sometimes a long delay) some
       
    32 automated feedback. In the end we will take a snapshot of the submitted files and
       
    33 apply an automated marking script to them.\medskip
       
    34 
       
    35 \noindent
       
    36 In addition, the Scala part comes with reference
       
    37 implementations in form of \texttt{jar}-files. This allows you to run
       
    38 any test cases on your own computer. For example you can call Scala on
       
    39 the command line with the option \texttt{-cp docdiff.jar} and then
       
    40 query any function from the template file. Say you want to find out
       
    41 what the function \texttt{occurrences} produces: for this you just need
       
    42 to prefix it with the object name \texttt{CW7a}.  If you want to find out what
       
    43 these functions produce for the list \texttt{List("a", "b", "b")},
       
    44 you would type something like:
       
    45 
       
    46 \begin{lstlisting}[language={},numbers=none,basicstyle=\ttfamily\small]
       
    47 $ scala -cp docdiff.jar
       
    48   
       
    49 scala> CW7a.occurrences(List("a", "b", "b"))
       
    50 ...
       
    51 \end{lstlisting}%$
       
    52 
       
    53 \subsection*{Hints}
       
    54 
       
    55 \noindent
       
    56 \textbf{For the Preliminary Part:} useful operations involving regular
       
    57 expressions:
       
    58 \[\texttt{reg.findAllIn(s).toList}\]
       
    59 \noindent finds all
       
    60 substrings in \texttt{s} according to a regular regular expression
       
    61 \texttt{reg}; useful list operations: \texttt{.distinct}
       
    62 removing duplicates from a list, \texttt{.count} counts the number of
       
    63 elements in a list that satisfy some condition, \texttt{.toMap}
       
    64 transfers a list of pairs into a Map, \texttt{.sum} adds up a list of
       
    65 integers, \texttt{.max} calculates the maximum of a list.\bigskip
       
    66 
       
    67 
       
    68 
       
    69 \newpage
       
    70 \subsection*{Preliminary Part (3 Marks, file docdiff.scala)}
       
    71 
       
    72 It seems source code plagiarism---stealing and submitting someone
       
    73 else's code---is a serious problem at other
       
    74 universities.\footnote{Surely, King's students, after all their
       
    75   instructions and warnings, would never commit such an offence. Yes?}
       
    76 Detecting such plagiarism is time-consuming and disheartening for
       
    77 lecturers at those universities. To aid these poor souls, let's
       
    78 implement in this part a program that determines the similarity
       
    79 between two documents (be they source code or texts in English). A
       
    80 document will be represented as a list of strings.
       
    81 
       
    82 
       
    83 \subsection*{Tasks}
       
    84 
       
    85 \begin{itemize}
       
    86 \item[(1)] Implement a function that `cleans' a string by finding all
       
    87   (proper) words in this string. For this use the regular expression
       
    88   \texttt{\textbackslash{}w+} for recognising words and the library function
       
    89   \texttt{findAllIn}. The function should return a document (a list of
       
    90   strings).\\
       
    91   \mbox{}\hfill [0.5 Marks]
       
    92 
       
    93 \item[(2)] In order to compute the overlap between two documents, we
       
    94   associate each document with a \texttt{Map}. This Map represents the
       
    95   strings in a document and how many times these strings occur in the
       
    96   document. A simple (though slightly inefficient) method for counting
       
    97   the number of string-occurrences in a document is as follows: remove
       
    98   all duplicates from the document; for each of these (unique)
       
    99   strings, count how many times they occur in the original document.
       
   100   Return a Map associating strings with occurrences. For example
       
   101 
       
   102   \begin{center}
       
   103   \pcode{occurrences(List("a", "b", "b", "c", "d"))}
       
   104   \end{center}
       
   105 
       
   106   produces \pcode{Map(a -> 1, b -> 2, c -> 1, d -> 1)} and
       
   107 
       
   108   \begin{center}
       
   109   \pcode{occurrences(List("d", "b", "d", "b", "d"))}
       
   110   \end{center}
       
   111 
       
   112   produces \pcode{Map(d -> 3, b -> 2)}.\hfill[1 Mark]
       
   113 
       
   114 \item[(3)] You can think of the Maps calculated under (2) as memory-efficient
       
   115   representations of sparse ``vectors''. In this subtask you need to
       
   116   implement the \emph{product} of two such vectors, sometimes also called
       
   117   \emph{dot product} of two vectors.\footnote{\url{https://en.wikipedia.org/wiki/Dot_product}}
       
   118 
       
   119   For this dot product, implement a function that takes two documents
       
   120   (\texttt{List[String]}) as arguments. The function first calculates
       
   121   the (unique) strings in both. For each string, it multiplies the
       
   122   corresponding occurrences in each document. If a string does not
       
   123   occur in one of the documents, then the product for this string is zero. At the end
       
   124   you need to add up all products. For the two documents in (2) the dot
       
   125   product is 7, because
       
   126 
       
   127   \[
       
   128     \underbrace{1 * 0}_{"a"} \;\;+\;\;
       
   129     \underbrace{2 * 2}_{"b"} \;\;+\;\;
       
   130     \underbrace{1 * 0}_{"c"} \;\;+\;\;
       
   131     \underbrace{1 * 3}_{"d"} \qquad = 7
       
   132   \]  
       
   133   
       
   134   \hfill\mbox{[1 Mark]}
       
   135 
       
   136 \item[(4)] Implement first a function that calculates the overlap
       
   137   between two documents, say $d_1$ and $d_2$, according to the formula
       
   138 
       
   139   \[
       
   140   \texttt{overlap}(d_1, d_2) = \frac{d_1 \cdot d_2}{max(d_1^2, d_2^2)}  
       
   141   \]
       
   142 
       
   143   where $d_1^2$ means $d_1 \cdot d_1$ and so on.
       
   144   You can expect this function to return a \texttt{Double} between 0 and 1. The
       
   145   overlap between the lists in (2) is $0.5384615384615384$.
       
   146 
       
   147   Second, implement a function that calculates the similarity of
       
   148   two strings, by first extracting the substrings using the clean
       
   149   function from (1)
       
   150   and then calculating the overlap of the resulting documents.\\
       
   151   \mbox{}\hfill\mbox{[0.5 Marks]}
       
   152 \end{itemize}
       
   153 
       
   154 
       
   155 \end{document} 
       
   156 
       
   157 %%% Local Variables: 
       
   158 %%% mode: latex
       
   159 %%% TeX-master: t
       
   160 %%% End: