handouts/ho01.tex
changeset 622 b47e140bcccd
parent 621 cf287db8dc15
child 690 8d57433c7b5e
equal deleted inserted replaced
621:cf287db8dc15 622:b47e140bcccd
    34 
    34 
    35 % compiler explorer
    35 % compiler explorer
    36 % https://gcc.godbolt.org
    36 % https://gcc.godbolt.org
    37 
    37 
    38 %https://www.youtube.com/watch?v=gmhMQfFQu20
    38 %https://www.youtube.com/watch?v=gmhMQfFQu20
       
    39 
       
    40 
    39 \begin{document}
    41 \begin{document}
    40 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2018, 2019}
    42 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2018, 2019}
    41 
    43 
    42 \section*{Handout 1}
    44 \section*{Handout 1}
    43 
    45 
    44 The first part of this module is about text processing, be it for
    46 The first part of this module is about text processing, be it for
    45 web-crawlers, compilers, dictionaries, DNA-data, ad filters and so on.
    47 web-crawlers, compilers, dictionaries, DNA-data, spam-filters and so on.
    46 When looking for a particular string, say \pcode{"foobar"}, in a large
    48 When looking for a particular string, say \pcode{"foobar"}, in a large
    47 text we can use the Knuth-Morris-Pratt algorithm, which is currently the
    49 text we can use the Knuth-Morris-Pratt algorithm, which is currently the
    48 most efficient general string search algorithm. But often we do
    50 most efficient general string search algorithm. But often we do
    49 \emph{not} just look for a particular string, but for string patterns.
    51 \emph{not} just look for a particular string, but for string patterns.
    50 For example, in program code we need to identify what are the keywords
    52 For example, in program code we need to identify what are the keywords
    57 %constraints programming languages impose about numbers: for example
    59 %constraints programming languages impose about numbers: for example
    58 %123 in JSON is OK, but 0123 is not.
    60 %123 in JSON is OK, but 0123 is not.
    59 
    61 
    60 Often we also face the problem that we are given a string, for example
    62 Often we also face the problem that we are given a string, for example
    61 some user input, and we want to know whether it matches a particular
    63 some user input, and we want to know whether it matches a particular
    62 pattern---be it an email address, for example. In this way we can
    64 pattern---is it an email address, for example. In this way we can
    63 exclude user input that would otherwise have nasty effects on our
    65 exclude user input that would otherwise have nasty effects on our
    64 program (crashing it or making it go into an infinite loop, if not
    66 program (crashing it or making it go into an infinite loop, if not
    65 worse). This kind of ``inspecting'' mechanism is implemented in popular
    67 worse). This kind of ``inspecting'' mechanism is also implemented in
    66 network security tools such as Snort and
    68 popular network security tools such as Snort and
    67 Bro.\footnote{\url{www.snort.org}, \url{www.bro.org}} They scan incoming
    69 Bro.\footnote{\url{www.snort.org}, \url{www.bro.org}} They scan incoming
    68 network traffic for computer viruses or malicious packets. Also
    70 network traffic for computer viruses or malicious packets. Similarly
    69 filtering out spam usually involves scanning for some signature
    71 filtering out spam usually involves looking for some signature
    70 (essentially a string pattern). The point is that the fast
    72 (essentially a string pattern). The point is that the fast
    71 Knuth-Morris-Pratt algorithm for strings is not good enough for such
    73 Knuth-Morris-Pratt algorithm for strings is not good enough for such
    72 string \emph{patterns}.\smallskip
    74 string \emph{patterns}.\smallskip
    73 
    75 
    74 \defn{Regular expressions} help with conveniently specifying
    76 \defn{Regular expressions} help with conveniently specifying
   197 
   199 
   198 Regular expressions were introduced by Kleene in the 1950ies and they
   200 Regular expressions were introduced by Kleene in the 1950ies and they
   199 have been object of intense study since then. They are nowadays pretty
   201 have been object of intense study since then. They are nowadays pretty
   200 much ubiquitous in computer science. There are many libraries
   202 much ubiquitous in computer science. There are many libraries
   201 implementing regular expressions. I am sure you have come across them
   203 implementing regular expressions. I am sure you have come across them
   202 before (remember the PRA module?). 
   204 before (remember the PRA or PEP modules?). 
   203 
   205 
   204 Why on earth then is there any interest in studying them again in depth
   206 Why on earth then is there any interest in studying them again in depth
   205 in this module? Well, one answer is in the following two graphs about
   207 in this module? Well, one answer is in the following two graphs about
   206 regular expression matching in Python, Ruby, JavaScript and Java
   208 regular expression matching in Python, Ruby, JavaScript and Java
   207 (Version 8).
   209 (Version 8).
   321   hackers can look for these instances where the matching engine behaves
   323   hackers can look for these instances where the matching engine behaves
   322   badly and then mount a nice DoS-attack against your application. These
   324   badly and then mount a nice DoS-attack against your application. These
   323   attacks are already have their own name: \emph{Regular Expression
   325   attacks are already have their own name: \emph{Regular Expression
   324   Denial of Service Attacks (ReDoS)}.
   326   Denial of Service Attacks (ReDoS)}.
   325 
   327 
   326 It will be instructive to look behind the ``scenes'' to find
   328 It will be instructive to look behind the ``scenes'' to find out why
   327 out why Python and Ruby (and others) behave so badly when
   329 Python and Ruby (and others) behave so badly when matching strings with
   328 matching strings with evil regular expressions. But we will also look
   330 evil regular expressions. But we will also look at a relatively simple
   329 at a relatively simple algorithm that solves this problem much
   331 algorithm that solves this problem much better than Python and Ruby
   330 better than Python and Ruby do\ldots actually it will be two
   332 do\ldots actually it will be two versions of the algorithm: the first
   331 versions of the algorithm: the first one will be able to
   333 one will be able in  the example \texttt{a?\{n\}\,a\{n\}} to process strings of
   332 process strings of approximately 1,100 \texttt{a}s in 23 
   334 approximately 1,100 \texttt{a}s in 23 seconds, while the second version
   333 seconds, while the second version will even be able to process
   335 will even be able to process up to 11,000(!) in 5 seconds, see the graph
   334 up to 11,000(!) in 5 seconds, see the graph below:
   336 below:
   335 
   337 
   336 \begin{center}
   338 \begin{center}
   337 \begin{tikzpicture}
   339 \begin{tikzpicture}
   338   \begin{axis}[
   340   \begin{axis}[
   339     title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings 
   341     title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings 
   347     ymax=32,
   349     ymax=32,
   348     ytick={0,5,...,30},
   350     ytick={0,5,...,30},
   349     scaled ticks=false,
   351     scaled ticks=false,
   350     axis lines=left,
   352     axis lines=left,
   351     width=7cm,
   353     width=7cm,
   352     height=4.5cm,
   354     height=4.4cm,
   353     legend entries={Our Algorithm V1, Our Algorithm V2},
   355     legend entries={Our Algorithm V1, Our Algorithm V2},
   354     legend pos=outer north east]
   356     legend pos=outer north east]
   355 \addplot[green,mark=square*,mark options={fill=white}] table {re2.data};
   357 \addplot[green,mark=square*,mark options={fill=white}] table {re2.data};
   356 \addplot[black,mark=square*,mark options={fill=white}] table {re3.data};
   358 \addplot[black,mark=square*,mark options={fill=white}] table {re3.data};
   357 \end{axis}
   359 \end{axis}