# HG changeset patch # User Christian Urban # Date 1564061977 -3600 # Node ID b47e140bcccdd2b95186c7253a6a8be0d82c754a # Parent cf287db8dc15a55320d1ba5044243428883b500d updated diff -r cf287db8dc15 -r b47e140bcccd handouts/ho01.pdf Binary file handouts/ho01.pdf has changed diff -r cf287db8dc15 -r b47e140bcccd handouts/ho01.tex --- a/handouts/ho01.tex Wed Jul 17 11:17:52 2019 +0100 +++ b/handouts/ho01.tex Thu Jul 25 14:39:37 2019 +0100 @@ -36,13 +36,15 @@ % https://gcc.godbolt.org %https://www.youtube.com/watch?v=gmhMQfFQu20 + + \begin{document} \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2018, 2019} \section*{Handout 1} The first part of this module is about text processing, be it for -web-crawlers, compilers, dictionaries, DNA-data, ad filters and so on. +web-crawlers, compilers, dictionaries, DNA-data, spam-filters and so on. When looking for a particular string, say \pcode{"foobar"}, in a large text we can use the Knuth-Morris-Pratt algorithm, which is currently the most efficient general string search algorithm. But often we do @@ -59,14 +61,14 @@ Often we also face the problem that we are given a string, for example some user input, and we want to know whether it matches a particular -pattern---be it an email address, for example. In this way we can +pattern---is it an email address, for example. In this way we can exclude user input that would otherwise have nasty effects on our program (crashing it or making it go into an infinite loop, if not -worse). This kind of ``inspecting'' mechanism is implemented in popular -network security tools such as Snort and +worse). This kind of ``inspecting'' mechanism is also implemented in +popular network security tools such as Snort and Bro.\footnote{\url{www.snort.org}, \url{www.bro.org}} They scan incoming -network traffic for computer viruses or malicious packets. Also -filtering out spam usually involves scanning for some signature +network traffic for computer viruses or malicious packets. Similarly +filtering out spam usually involves looking for some signature (essentially a string pattern). The point is that the fast Knuth-Morris-Pratt algorithm for strings is not good enough for such string \emph{patterns}.\smallskip @@ -199,7 +201,7 @@ have been object of intense study since then. They are nowadays pretty much ubiquitous in computer science. There are many libraries implementing regular expressions. I am sure you have come across them -before (remember the PRA module?). +before (remember the PRA or PEP modules?). Why on earth then is there any interest in studying them again in depth in this module? Well, one answer is in the following two graphs about @@ -323,15 +325,15 @@ attacks are already have their own name: \emph{Regular Expression Denial of Service Attacks (ReDoS)}. -It will be instructive to look behind the ``scenes'' to find -out why Python and Ruby (and others) behave so badly when -matching strings with evil regular expressions. But we will also look -at a relatively simple algorithm that solves this problem much -better than Python and Ruby do\ldots actually it will be two -versions of the algorithm: the first one will be able to -process strings of approximately 1,100 \texttt{a}s in 23 -seconds, while the second version will even be able to process -up to 11,000(!) in 5 seconds, see the graph below: +It will be instructive to look behind the ``scenes'' to find out why +Python and Ruby (and others) behave so badly when matching strings with +evil regular expressions. But we will also look at a relatively simple +algorithm that solves this problem much better than Python and Ruby +do\ldots actually it will be two versions of the algorithm: the first +one will be able in the example \texttt{a?\{n\}\,a\{n\}} to process strings of +approximately 1,100 \texttt{a}s in 23 seconds, while the second version +will even be able to process up to 11,000(!) in 5 seconds, see the graph +below: \begin{center} \begin{tikzpicture} @@ -349,7 +351,7 @@ scaled ticks=false, axis lines=left, width=7cm, - height=4.5cm, + height=4.4cm, legend entries={Our Algorithm V1, Our Algorithm V2}, legend pos=outer north east] \addplot[green,mark=square*,mark options={fill=white}] table {re2.data};