| 
203
 | 
     1  | 
// Part 1 about Code Similarity
  | 
| 
 | 
     2  | 
//==============================
  | 
| 
 | 
     3  | 
  | 
| 
 | 
     4  | 
//(1) Complete the clean function below. It should find
  | 
| 
 | 
     5  | 
//    all words in a string using the regular expression
  | 
| 
 | 
     6  | 
//    \w+  and the library function 
  | 
| 
 | 
     7  | 
//
  | 
| 
 | 
     8  | 
//         some_regex.findAllIn(some_string)
  | 
| 
 | 
     9  | 
//
  | 
| 
 | 
    10  | 
//    The words should be Returned as a list of strings.
  | 
| 
 | 
    11  | 
  | 
| 
 | 
    12  | 
  | 
| 
 | 
    13  | 
//def clean(s: String) : List[String] = ...
  | 
| 
 | 
    14  | 
  
  | 
| 
 | 
    15  | 
  | 
| 
 | 
    16  | 
  | 
| 
 | 
    17  | 
//(2) The function occurrences calculates the number of times  
  | 
| 
 | 
    18  | 
//    strings occur in a list of strings. These occurrences should 
  | 
| 
 | 
    19  | 
//    be calculated as a Map from strings to integers.
  | 
| 
 | 
    20  | 
  | 
| 
 | 
    21  | 
  | 
| 
 | 
    22  | 
//def occurrences(xs: List[String]): Map[String, Int] = ..
  | 
| 
 | 
    23  | 
  | 
| 
 | 
    24  | 
  | 
| 
 | 
    25  | 
//(3) This functions calculates the dot-product of two documents
  | 
| 
 | 
    26  | 
//    (list of strings). For this it calculates the occurrence
  | 
| 
 | 
    27  | 
//    maps from (2) and then multiplies the corresponding occurrences. 
  | 
| 
 | 
    28  | 
//    If a string does not occur in a document, the product is zero.
  | 
| 
 | 
    29  | 
//    The function finally sums up all products. 
  | 
| 
 | 
    30  | 
  | 
| 
 | 
    31  | 
  | 
| 
 | 
    32  | 
//def prod(lst1: List[String], lst2: List[String]) : Int = ..
  | 
| 
 | 
    33  | 
  | 
| 
 | 
    34  | 
  | 
| 
 | 
    35  | 
//(4) Complete the functions overlap and similarity. The overlap of
  | 
| 
 | 
    36  | 
//    two documents is calculated by the formula given in the assignment
  | 
| 
 | 
    37  | 
//    description. The similarity of two strings is given by the overlap
  | 
| 
 | 
    38  | 
//    of the cleaned strings (see (1)).  
  | 
| 
 | 
    39  | 
  | 
| 
 | 
    40  | 
  | 
| 
 | 
    41  | 
//def overlap(lst1: List[String], lst2: List[String]) : Double = ...
  | 
| 
 | 
    42  | 
  | 
| 
 | 
    43  | 
//def similarity(s1: String, s2: String) : Double = ...
  | 
| 
 | 
    44  | 
  | 
| 
 | 
    45  | 
  | 
| 
 | 
    46  | 
  | 
| 
 | 
    47  | 
  | 
| 
 | 
    48  | 
/* Test cases
  | 
| 
 | 
    49  | 
  | 
| 
 | 
    50  | 
  | 
| 
 | 
    51  | 
val list1 = List("a", "b", "b", "c", "d") 
 | 
| 
 | 
    52  | 
val list2 = List("d", "b", "d", "b", "d")
 | 
| 
 | 
    53  | 
  | 
| 
 | 
    54  | 
occurrences(List("a", "b", "b", "c", "d"))   // Map(a -> 1, b -> 2, c -> 1, d -> 1)
 | 
| 
 | 
    55  | 
occurrences(List("d", "b", "d", "b", "d"))   // Map(d -> 3, b -> 2)
 | 
| 
 | 
    56  | 
  | 
| 
 | 
    57  | 
prod(list1,list2) // 7 
  | 
| 
 | 
    58  | 
  | 
| 
 | 
    59  | 
overlap(list1, list2)   // 0.5384615384615384
  | 
| 
 | 
    60  | 
overlap(list2, list1)   // 0.5384615384615384
  | 
| 
 | 
    61  | 
overlap(list1, list1)   // 1.0
  | 
| 
 | 
    62  | 
overlap(list2, list2)   // 1.0
  | 
| 
 | 
    63  | 
  | 
| 
 | 
    64  | 
// Plagiarism examples from 
  | 
| 
 | 
    65  | 
// https://desales.libguides.com/avoidingplagiarism/examples
  | 
| 
 | 
    66  | 
  | 
| 
 | 
    67  | 
val orig1 = """There is a strong market demand for eco-tourism in
  | 
| 
 | 
    68  | 
Australia. Its rich and diverse natural heritage ensures Australia's
  | 
| 
 | 
    69  | 
capacity to attract international ecotourists and gives Australia a
  | 
| 
 | 
    70  | 
comparative advantage in the highly competitive tourism industry."""
  | 
| 
 | 
    71  | 
  | 
| 
 | 
    72  | 
val plag1 = """There is a high market demand for eco-tourism in
  | 
| 
 | 
    73  | 
Australia. Australia has a comparative advantage in the highly
  | 
| 
 | 
    74  | 
competitive tourism industry due to its rich and varied natural
  | 
| 
 | 
    75  | 
heritage which ensures Australia's capacity to attract international
  | 
| 
 | 
    76  | 
ecotourists."""
  | 
| 
 | 
    77  | 
  | 
| 
 | 
    78  | 
similarity(orig1, plag1) // 0.8679245283018868
  | 
| 
 | 
    79  | 
  | 
| 
 | 
    80  | 
  | 
| 
 | 
    81  | 
// Plagiarism examples from 
  | 
| 
 | 
    82  | 
// https://www.utc.edu/library/help/tutorials/plagiarism/examples-of-plagiarism.php
  | 
| 
 | 
    83  | 
  | 
| 
 | 
    84  | 
val orig2 = """No oil spill is entirely benign. Depending on timing and
  | 
| 
 | 
    85  | 
location, even a relatively minor spill can cause significant harm to
  | 
| 
 | 
    86  | 
individual organisms and entire populations. Oil spills can cause
  | 
| 
 | 
    87  | 
impacts over a range of time scales, from days to years, or even
  | 
| 
 | 
    88  | 
decades for certain spills. Impacts are typically divided into acute
  | 
| 
 | 
    89  | 
(short-term) and chronic (long-term) effects. Both types are part of a
  | 
| 
 | 
    90  | 
complicated and often controversial equation that is addressed after
  | 
| 
 | 
    91  | 
an oil spill: ecosystem recovery."""
  | 
| 
 | 
    92  | 
  | 
| 
 | 
    93  | 
val plag2 = """There is no such thing as a "good" oil spill. If the
  | 
| 
 | 
    94  | 
time and place are just right, even a small oil spill can cause damage
  | 
| 
 | 
    95  | 
to sensitive ecosystems. Further, spills can cause harm days, months,
  | 
| 
 | 
    96  | 
years, or even decades after they occur. Because of this, spills are
  | 
| 
 | 
    97  | 
usually broken into short-term (acute) and long-term (chronic)
  | 
| 
 | 
    98  | 
effects. Both of these types of harm must be addressed in ecosystem
  | 
| 
 | 
    99  | 
recovery: a controversial tactic that is often implemented immediately
  | 
| 
 | 
   100  | 
following an oil spill."""
  | 
| 
 | 
   101  | 
  | 
| 
 | 
   102  | 
overlap(clean(orig2), clean(plag2))  // 0.728
  | 
| 
 | 
   103  | 
similarity(orig2, plag2)             // 0.728
  | 
| 
 | 
   104  | 
  | 
| 
 | 
   105  | 
  | 
| 
 | 
   106  | 
 
  | 
| 
 | 
   107  | 
// The punchline: everything above 0.6 looks suspicious and 
  | 
| 
 | 
   108  | 
// should be investigated by staff.
  | 
| 
 | 
   109  | 
  | 
| 
 | 
   110  | 
*/
  |