34 |
34 |
35 % compiler explorer |
35 % compiler explorer |
36 % https://gcc.godbolt.org |
36 % https://gcc.godbolt.org |
37 |
37 |
38 %https://www.youtube.com/watch?v=gmhMQfFQu20 |
38 %https://www.youtube.com/watch?v=gmhMQfFQu20 |
|
39 |
|
40 |
39 \begin{document} |
41 \begin{document} |
40 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2018, 2019} |
42 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2018, 2019} |
41 |
43 |
42 \section*{Handout 1} |
44 \section*{Handout 1} |
43 |
45 |
44 The first part of this module is about text processing, be it for |
46 The first part of this module is about text processing, be it for |
45 web-crawlers, compilers, dictionaries, DNA-data, ad filters and so on. |
47 web-crawlers, compilers, dictionaries, DNA-data, spam-filters and so on. |
46 When looking for a particular string, say \pcode{"foobar"}, in a large |
48 When looking for a particular string, say \pcode{"foobar"}, in a large |
47 text we can use the Knuth-Morris-Pratt algorithm, which is currently the |
49 text we can use the Knuth-Morris-Pratt algorithm, which is currently the |
48 most efficient general string search algorithm. But often we do |
50 most efficient general string search algorithm. But often we do |
49 \emph{not} just look for a particular string, but for string patterns. |
51 \emph{not} just look for a particular string, but for string patterns. |
50 For example, in program code we need to identify what are the keywords |
52 For example, in program code we need to identify what are the keywords |
57 %constraints programming languages impose about numbers: for example |
59 %constraints programming languages impose about numbers: for example |
58 %123 in JSON is OK, but 0123 is not. |
60 %123 in JSON is OK, but 0123 is not. |
59 |
61 |
60 Often we also face the problem that we are given a string, for example |
62 Often we also face the problem that we are given a string, for example |
61 some user input, and we want to know whether it matches a particular |
63 some user input, and we want to know whether it matches a particular |
62 pattern---be it an email address, for example. In this way we can |
64 pattern---is it an email address, for example. In this way we can |
63 exclude user input that would otherwise have nasty effects on our |
65 exclude user input that would otherwise have nasty effects on our |
64 program (crashing it or making it go into an infinite loop, if not |
66 program (crashing it or making it go into an infinite loop, if not |
65 worse). This kind of ``inspecting'' mechanism is implemented in popular |
67 worse). This kind of ``inspecting'' mechanism is also implemented in |
66 network security tools such as Snort and |
68 popular network security tools such as Snort and |
67 Bro.\footnote{\url{www.snort.org}, \url{www.bro.org}} They scan incoming |
69 Bro.\footnote{\url{www.snort.org}, \url{www.bro.org}} They scan incoming |
68 network traffic for computer viruses or malicious packets. Also |
70 network traffic for computer viruses or malicious packets. Similarly |
69 filtering out spam usually involves scanning for some signature |
71 filtering out spam usually involves looking for some signature |
70 (essentially a string pattern). The point is that the fast |
72 (essentially a string pattern). The point is that the fast |
71 Knuth-Morris-Pratt algorithm for strings is not good enough for such |
73 Knuth-Morris-Pratt algorithm for strings is not good enough for such |
72 string \emph{patterns}.\smallskip |
74 string \emph{patterns}.\smallskip |
73 |
75 |
74 \defn{Regular expressions} help with conveniently specifying |
76 \defn{Regular expressions} help with conveniently specifying |
197 |
199 |
198 Regular expressions were introduced by Kleene in the 1950ies and they |
200 Regular expressions were introduced by Kleene in the 1950ies and they |
199 have been object of intense study since then. They are nowadays pretty |
201 have been object of intense study since then. They are nowadays pretty |
200 much ubiquitous in computer science. There are many libraries |
202 much ubiquitous in computer science. There are many libraries |
201 implementing regular expressions. I am sure you have come across them |
203 implementing regular expressions. I am sure you have come across them |
202 before (remember the PRA module?). |
204 before (remember the PRA or PEP modules?). |
203 |
205 |
204 Why on earth then is there any interest in studying them again in depth |
206 Why on earth then is there any interest in studying them again in depth |
205 in this module? Well, one answer is in the following two graphs about |
207 in this module? Well, one answer is in the following two graphs about |
206 regular expression matching in Python, Ruby, JavaScript and Java |
208 regular expression matching in Python, Ruby, JavaScript and Java |
207 (Version 8). |
209 (Version 8). |
321 hackers can look for these instances where the matching engine behaves |
323 hackers can look for these instances where the matching engine behaves |
322 badly and then mount a nice DoS-attack against your application. These |
324 badly and then mount a nice DoS-attack against your application. These |
323 attacks are already have their own name: \emph{Regular Expression |
325 attacks are already have their own name: \emph{Regular Expression |
324 Denial of Service Attacks (ReDoS)}. |
326 Denial of Service Attacks (ReDoS)}. |
325 |
327 |
326 It will be instructive to look behind the ``scenes'' to find |
328 It will be instructive to look behind the ``scenes'' to find out why |
327 out why Python and Ruby (and others) behave so badly when |
329 Python and Ruby (and others) behave so badly when matching strings with |
328 matching strings with evil regular expressions. But we will also look |
330 evil regular expressions. But we will also look at a relatively simple |
329 at a relatively simple algorithm that solves this problem much |
331 algorithm that solves this problem much better than Python and Ruby |
330 better than Python and Ruby do\ldots actually it will be two |
332 do\ldots actually it will be two versions of the algorithm: the first |
331 versions of the algorithm: the first one will be able to |
333 one will be able in the example \texttt{a?\{n\}\,a\{n\}} to process strings of |
332 process strings of approximately 1,100 \texttt{a}s in 23 |
334 approximately 1,100 \texttt{a}s in 23 seconds, while the second version |
333 seconds, while the second version will even be able to process |
335 will even be able to process up to 11,000(!) in 5 seconds, see the graph |
334 up to 11,000(!) in 5 seconds, see the graph below: |
336 below: |
335 |
337 |
336 \begin{center} |
338 \begin{center} |
337 \begin{tikzpicture} |
339 \begin{tikzpicture} |
338 \begin{axis}[ |
340 \begin{axis}[ |
339 title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings |
341 title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings |
347 ymax=32, |
349 ymax=32, |
348 ytick={0,5,...,30}, |
350 ytick={0,5,...,30}, |
349 scaled ticks=false, |
351 scaled ticks=false, |
350 axis lines=left, |
352 axis lines=left, |
351 width=7cm, |
353 width=7cm, |
352 height=4.5cm, |
354 height=4.4cm, |
353 legend entries={Our Algorithm V1, Our Algorithm V2}, |
355 legend entries={Our Algorithm V1, Our Algorithm V2}, |
354 legend pos=outer north east] |
356 legend pos=outer north east] |
355 \addplot[green,mark=square*,mark options={fill=white}] table {re2.data}; |
357 \addplot[green,mark=square*,mark options={fill=white}] table {re2.data}; |
356 \addplot[black,mark=square*,mark options={fill=white}] table {re3.data}; |
358 \addplot[black,mark=square*,mark options={fill=white}] table {re3.data}; |
357 \end{axis} |
359 \end{axis} |