34 % compiler explorer |
34 % compiler explorer |
35 % https://gcc.godbolt.org |
35 % https://gcc.godbolt.org |
36 |
36 |
37 %https://www.youtube.com/watch?v=gmhMQfFQu20 |
37 %https://www.youtube.com/watch?v=gmhMQfFQu20 |
38 \begin{document} |
38 \begin{document} |
39 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017} |
39 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2018} |
40 |
40 |
41 \section*{Handout 1} |
41 \section*{Handout 1} |
42 |
42 |
43 This module is about text processing, be it for web-crawlers, |
43 This module is about text processing, be it for web-crawlers, |
44 compilers, dictionaries, DNA-data, ad filters and so on. When looking for a |
44 compilers, dictionaries, DNA-data, ad filters and so on. When looking for a |
45 particular string, like $abc$ in a large text we can use the |
45 particular string, like $abc$ in a large text we can use the |
46 Knuth-Morris-Pratt algorithm, which is currently the most efficient |
46 Knuth-Morris-Pratt algorithm, which is currently the most efficient |
47 general string search algorithm. But often we do \emph{not} just look |
47 general string search algorithm. But often we do \emph{not} just look |
48 for a particular string, but for string patterns. For example, in |
48 for a particular string, but for string patterns. For example, in |
49 program code we need to identify what are the keywords (if, then, |
49 program code we need to identify what are the keywords (\texttt{if}, \texttt{then}, |
50 while, for, etc), what are the identifiers (variable names). A pattern for |
50 \texttt{while}, \texttt{for}, etc), what are the identifiers (variable names). A pattern for |
51 identifiers could be stated as: they start with a letter, followed by |
51 identifiers could be stated as: they start with a letter, followed by |
52 zero or more letters, numbers and underscores. Often we also face the |
52 zero or more letters, numbers and underscores. Often we also face the |
53 problem that we are given a string (for example some user input) and |
53 problem that we are given a string (for example some user input) and |
54 want to know whether it matches a particular pattern---be it an email |
54 want to know whether it matches a particular pattern---be it an email |
55 address, for example. In this way we can exclude user input that would |
55 address, for example. In this way we can exclude user input that would |
104 \end{lstlisting} |
104 \end{lstlisting} |
105 |
105 |
106 \noindent according to the regular expression we specified in line |
106 \noindent according to the regular expression we specified in line |
107 \eqref{email} above. Whether this is intended or not is a different |
107 \eqref{email} above. Whether this is intended or not is a different |
108 question (the second email above is actually an acceptable email |
108 question (the second email above is actually an acceptable email |
109 address acording to the RFC 5322 standard for email addresses). |
109 address according to the RFC 5322 standard for email addresses). |
110 |
110 |
111 As mentioned above, identifiers, or variables, in program code |
111 As mentioned above, identifiers, or variables, in program code |
112 are often required to satisfy the constraints that they start |
112 are often required to satisfy the constraints that they start |
113 with a letter and then can be followed by zero or more letters |
113 with a letter and then can be followed by zero or more letters |
114 or numbers and also can include underscores, but not as the |
114 or numbers and also can include underscores, but not as the |
138 \pcode{re?} & matches 0 or 1 occurrence of preceding |
138 \pcode{re?} & matches 0 or 1 occurrence of preceding |
139 expression\\ |
139 expression\\ |
140 \pcode{re\{n\}} & matches exactly \pcode{n} number of |
140 \pcode{re\{n\}} & matches exactly \pcode{n} number of |
141 occurrences of preceding expression\\ |
141 occurrences of preceding expression\\ |
142 \pcode{re\{n,m\}} & matches at least \pcode{n} and at most {\tt m} |
142 \pcode{re\{n,m\}} & matches at least \pcode{n} and at most {\tt m} |
143 occurences of the preceding expression\\ |
143 occurrences of the preceding expression\\ |
144 \pcode{[...]} & matches any single character inside the |
144 \pcode{[...]} & matches any single character inside the |
145 brackets\\ |
145 brackets\\ |
146 \pcode{[^...]} & matches any single character not inside the |
146 \pcode{[^...]} & matches any single character not inside the |
147 brackets\\ |
147 brackets\\ |
148 \pcode{...-...} & character ranges\\ |
148 \pcode{...-...} & character ranges\\ |
189 much ubiquitous in computer science. There are many libraries |
189 much ubiquitous in computer science. There are many libraries |
190 implementing regular expressions. I am sure you have come across them |
190 implementing regular expressions. I am sure you have come across them |
191 before (remember the PRA module?). Why on earth then is there any |
191 before (remember the PRA module?). Why on earth then is there any |
192 interest in studying them again in depth in this module? Well, one |
192 interest in studying them again in depth in this module? Well, one |
193 answer is in the following two graphs about regular expression |
193 answer is in the following two graphs about regular expression |
194 matching in Python, Ruby and Java. |
194 matching in Python, Ruby and Java (Version 8). |
195 |
195 |
196 \begin{center} |
196 \begin{center} |
197 \begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1mm}}c@{}} |
197 \begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1mm}}c@{}} |
198 \begin{tikzpicture} |
198 \begin{tikzpicture} |
199 \begin{axis}[ |
199 \begin{axis}[ |
271 \begin{center} |
271 \begin{center} |
272 \url{https://vimeo.com/112065252} |
272 \url{https://vimeo.com/112065252} |
273 \end{center} |
273 \end{center} |
274 |
274 |
275 \noindent |
275 \noindent |
276 A similar problem also occured in the Atom editor: |
276 A similar problem also occurred in the Atom editor: |
277 |
277 |
278 \begin{center} |
278 \begin{center} |
279 \url{http://davidvgalbraith.com/how-i-fixed-atom/} |
279 \url{http://davidvgalbraith.com/how-i-fixed-atom/} |
280 \end{center} |
280 \end{center} |
281 |
281 |
282 \noindent |
282 \noindent |
283 Such troublesome regular expressions are sometimes called \emph{evil |
283 Such troublesome regular expressions are sometimes called \emph{evil |
284 regular expressions} because they have the potential to make regular |
284 regular expressions} because they have the potential to make regular |
285 expression matching engines to topple over, like in Python, Ruby and |
285 expression matching engines to topple over, like in Python, Ruby and |
286 Java. This ``toppling over'' is also sometimes called \emph{catastrophic |
286 Java. This ``toppling over'' is also sometimes called |
287 backtracking}. The problem with evil regular expressions is that |
287 \emph{catastrophic backtracking}. I have also seen the term |
288 they can have some serious consequences, for example, if you use them |
288 \emph{eternal matching} used for this. The problem with evil regular |
289 in your web-application. The reason is that hackers can look for these |
289 expressions is that they can have some serious consequences, for |
290 instances where the matching engine behaves badly and then mount a |
290 example, if you use them in your web-application. The reason is that |
291 nice DoS-attack against your application. These attacks are already |
291 hackers can look for these instances where the matching engine behaves |
292 have their own name: \emph{Regular Expression Denial of Service |
292 badly and then mount a nice DoS-attack against your application. These |
293 Attacks (ReDoS)}. |
293 attacks are already have their own name: \emph{Regular Expression |
|
294 Denial of Service Attacks (ReDoS)}. |
294 |
295 |
295 It will be instructive to look behind the ``scenes'' to find |
296 It will be instructive to look behind the ``scenes'' to find |
296 out why Python and Ruby (and others) behave so badly when |
297 out why Python and Ruby (and others) behave so badly when |
297 matching strings with evil regular expressions. But we will also look |
298 matching strings with evil regular expressions. But we will also look |
298 at a relatively simple algorithm that solves this problem much |
299 at a relatively simple algorithm that solves this problem much |
403 structure of regular expressions is always clear. But for |
404 structure of regular expressions is always clear. But for |
404 writing them down in a more mathematical fashion, parentheses |
405 writing them down in a more mathematical fashion, parentheses |
405 will be helpful. For example we will write $(r_1 + r_2)^*$, |
406 will be helpful. For example we will write $(r_1 + r_2)^*$, |
406 which is different from, say $r_1 + (r_2)^*$. The former means |
407 which is different from, say $r_1 + (r_2)^*$. The former means |
407 roughly zero or more times $r_1$ or $r_2$, while the latter |
408 roughly zero or more times $r_1$ or $r_2$, while the latter |
408 means $r_1$ or zero or more times $r_2$. This will turn out to |
409 means $r_1$, or zero or more times $r_2$. This will turn out to |
409 be two different patterns, which match in general different |
410 be two different patterns, which match in general different |
410 strings. We should also write $(r_1 + r_2) + r_3$, which is |
411 strings. We should also write $(r_1 + r_2) + r_3$, which is |
411 different from the regular expression $r_1 + (r_2 + r_3)$, but |
412 different from the regular expression $r_1 + (r_2 + r_3)$, but |
412 in case of $+$ and $\cdot$ we actually do not care about the |
413 in case of $+$ and $\cdot$ we actually do not care about the |
413 order and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2 |
414 order and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2 |
414 \cdot r_3$, respectively. The reasons for this will become |
415 \cdot r_3$, respectively. The reasons for this carelessness will become |
415 clear shortly. |
416 clear shortly. |
416 |
417 |
417 In the literature you will often find that the choice $r_1 + |
418 In the literature you will often find that the choice $r_1 + |
418 r_2$ is written as $r_1\mid{}r_2$ or $r_1\mid\mid{}r_2$. Also, |
419 r_2$ is written as $r_1\mid{}r_2$ or $r_1\mid\mid{}r_2$. Also, |
419 often our $\ZERO$ and $\ONE$ are written $\varnothing$ and |
420 often our $\ZERO$ and $\ONE$ are written $\varnothing$ and |
420 $\epsilon$, respectively. Following the convention in the |
421 $\epsilon$, respectively. Following the convention in the |
421 literature, we will often omit the $\cdot$ all together. This |
422 literature, we will often omit the $\cdot$. This |
422 is to make some concrete regular expressions more readable. |
423 is to make some concrete regular expressions more readable. |
423 For example the regular expression for email addresses shown |
424 For example the regular expression for email addresses shown |
424 in \eqref{email} would look like |
425 in \eqref{email} would fully expanded look like |
425 |
426 |
426 \[ |
427 \[ |
427 \texttt{[...]+} \;\cdot\; \texttt{@} \;\cdot\; |
428 \texttt{[...]+} \;\cdot\; \texttt{@} \;\cdot\; |
428 \texttt{[...]+} \;\cdot\; \texttt{.} \;\cdot\; |
429 \texttt{[...]+} \;\cdot\; \texttt{.} \;\cdot\; |
429 \texttt{[...]\{2,6\}} |
430 \texttt{[...]\{2,6\}} |
582 even have the vague recollection that I did not quite understand them |
584 even have the vague recollection that I did not quite understand them |
583 during my undergraduate study. If I remember correctly,\footnote{That |
585 during my undergraduate study. If I remember correctly,\footnote{That |
584 was really a long time ago.} I got utterly confused about $\ONE$ |
586 was really a long time ago.} I got utterly confused about $\ONE$ |
585 (which my lecturer wrote as $\epsilon$) and the empty string (which he |
587 (which my lecturer wrote as $\epsilon$) and the empty string (which he |
586 also wrote as $\epsilon$).\footnote{Obviously the lecturer must have |
588 also wrote as $\epsilon$).\footnote{Obviously the lecturer must have |
587 been bad ;o)} Since my then, I have used regular expressions for |
589 been bad ;o)} Since then, I have used regular expressions for |
588 implementing lexers and parsers as I have always been interested in |
590 implementing lexers and parsers as I have always been interested in |
589 all kinds of programming languages and compilers, which invariably |
591 all kinds of programming languages and compilers, which invariably |
590 need regular expressions in some form or shape. |
592 need regular expressions in some form or shape. |
591 |
593 |
592 To understand my fascination \emph{nowadays} with regular |
594 To understand my fascination \emph{nowadays} with regular |
594 for the last 17 years has been with theorem provers. I am a |
596 for the last 17 years has been with theorem provers. I am a |
595 core developer of one of |
597 core developer of one of |
596 them.\footnote{\url{http://isabelle.in.tum.de}} Theorem |
598 them.\footnote{\url{http://isabelle.in.tum.de}} Theorem |
597 provers are systems in which you can formally reason about |
599 provers are systems in which you can formally reason about |
598 mathematical concepts, but also about programs. In this way |
600 mathematical concepts, but also about programs. In this way |
599 theorem provers can help with the manacing problem of writing bug-free code. Theorem provers have |
601 theorem provers can help with the menacing problem of writing bug-free code. Theorem provers have |
600 proved already their value in a number of cases (even in |
602 proved already their value in a number of cases (even in |
601 terms of hard cash), but they are still clunky and difficult |
603 terms of hard cash), but they are still clunky and difficult |
602 to use by average programmers. |
604 to use by average programmers. |
603 |
605 |
604 Anyway, in about 2011 I came across the notion of \defn{derivatives of |
606 Anyway, in about 2011 I came across the notion of \defn{derivatives of |
605 regular expressions}. This notion allows one to do almost all |
607 regular expressions}. This notion allows one to do almost all |
606 calculations with regular expressions on the level of regular |
608 calculations with regular expressions on the level of regular |
607 expressions, not needing any automata (you will see we only touch |
609 expressions, not needing any automata (you will see we only touch |
608 briefly on automata in lecture 3). Automata are usually the main |
610 briefly on automata in lecture 3). Automata are usually the main |
609 object of study in formal language courses. The avoidance of automata |
611 object of study in formal language courses. The avoidance of automata |
610 is crucial because automata are graphs and it is rather difficult to |
612 is crucial for me because automata are graphs and it is rather difficult to |
611 reason about graphs in theorem provers. In contrast, reasoning about |
613 reason about graphs in theorem provers. In contrast, reasoning about |
612 regular expressions is easy-peasy in theorem provers. Is this |
614 regular expressions is easy-peasy in theorem provers. Is this |
613 important? I think yes, because according to Kuklewicz nearly all |
615 important? I think yes, because according to Kuklewicz nearly all |
614 POSIX-based regular expression matchers are |
616 POSIX-based regular expression matchers are |
615 buggy.\footnote{\url{http://www.haskell.org/haskellwiki/Regex_Posix}} |
617 buggy.\footnote{\url{http://www.haskell.org/haskellwiki/Regex_Posix}} |
627 expressions and the notion of derivatives. Earlier versions of |
629 expressions and the notion of derivatives. Earlier versions of |
628 this theorem used always automata in the proof. Using this |
630 this theorem used always automata in the proof. Using this |
629 theorem we can show that regular languages are closed under |
631 theorem we can show that regular languages are closed under |
630 complementation, something which Gasarch in his |
632 complementation, something which Gasarch in his |
631 blog\footnote{\url{http://goo.gl/2R11Fw}} assumed can only be |
633 blog\footnote{\url{http://goo.gl/2R11Fw}} assumed can only be |
632 shown via automata. Even sombody who has written a 700+-page |
634 shown via automata. So even somebody who has written a 700+-page |
633 book\footnote{\url{http://goo.gl/fD0eHx}} on regular |
635 book\footnote{\url{http://goo.gl/fD0eHx}} on regular |
634 exprssions did not know better. Well, we showed it can also be |
636 expressions did not know better. Well, we showed it can also be |
635 done with regular expressions only.\footnote{\url{http://nms.kcl.ac.uk/christian.urban/Publications/posix.pdf}} |
637 done with regular expressions only.\footnote{\url{http://nms.kcl.ac.uk/christian.urban/Publications/posix.pdf}} |
636 What a feeling when you are an outsider to the subject! |
638 What a feeling when you are an outsider to the subject! |
637 |
639 |
638 To conclude: Despite my early ignorance about regular expressions, I |
640 To conclude: Despite my early ignorance about regular expressions, I |
639 find them now very interesting. They have a beautiful mathematical |
641 find them now extremely interesting. They have practical importance |
640 theory behind them, which can be sometimes quite deep and which |
|
641 sometimes contains hidden snares. They have practical importance |
|
642 (remember the shocking runtime of the regular expression matchers in |
642 (remember the shocking runtime of the regular expression matchers in |
643 Python, Ruby and Java in some instances and the problems in Stack |
643 Python, Ruby and Java in some instances and the problems in Stack |
644 Exchange and the Atom editor). People who are not very familiar with |
644 Exchange and the Atom editor). They are used in tools like Snort and |
645 the mathematical background of regular expressions get them |
645 Bro in order to monitor network traffic. They have a beautiful mathematical |
|
646 theory behind them, which can be sometimes quite deep and which |
|
647 sometimes contains hidden snares. People who are not very familiar |
|
648 with the mathematical background of regular expressions get them |
646 consistently wrong (this is surprising given they are a supposed to be |
649 consistently wrong (this is surprising given they are a supposed to be |
647 core skill for computer scientists). The hope is that we can do better |
650 core skill for computer scientists). The hope is that we can do better |
648 in the future---for example by proving that the algorithms actually |
651 in the future---for example by proving that the algorithms actually |
649 satisfy their specification and that the corresponding implementations |
652 satisfy their specification and that the corresponding implementations |
650 do not contain any bugs. We are close, but not yet quite there. |
653 do not contain any bugs. We are close, but not yet quite there. |
651 |
654 |
652 Notwithstanding my fascination, I am also happy to admit that regular |
655 Notwithstanding my fascination, I am also happy to admit that regular |
653 expressions have their shortcomings. There are some well-known |
656 expressions have their shortcomings. There are some well-known |
654 ``theoretical'' shortcomings, for example recognising strings of the |
657 ``theoretical'' shortcomings, for example recognising strings of the |
655 form $a^{n}b^{n}$ is not possible with regular expressions. This means |
658 form $a^{n}b^{n}$ is not possible with regular expressions. This means |
656 for example if we try to regognise whether parentheses are well-nested |
659 for example if we try to recognise whether parentheses are well-nested |
657 in an expression is impossible with (basic) regular expressions. I am |
660 in an expression is impossible with (basic) regular expressions. I am |
658 not so bothered by these shortcomings. What I am bothered about is |
661 not so bothered by these shortcomings. What I am bothered about is |
659 when regular expressions are in the way of practical programming. For |
662 when regular expressions are in the way of practical programming. For |
660 example, it turns out that the regular expression for email addresses |
663 example, it turns out that the regular expression for email addresses |
661 shown in \eqref{email} is hopelessly inadequate for recognising all of |
664 shown in \eqref{email} is hopelessly inadequate for recognising all of |