45 Also often we face the problem that we are given a string (for |
45 Also often we face the problem that we are given a string (for |
46 example some user input) and want to know whether it matches a |
46 example some user input) and want to know whether it matches a |
47 particular pattern---be it an email address, for example. In |
47 particular pattern---be it an email address, for example. In |
48 this way we can exclude user input that would otherwise have |
48 this way we can exclude user input that would otherwise have |
49 nasty effects on our program (crashing it or making it go into |
49 nasty effects on our program (crashing it or making it go into |
50 an infinite loop, if not worse). |
50 an infinite loop, if not worse).\smallskip |
51 |
51 |
52 \defn{Regular expressions} help with conveniently specifying |
52 \defn{Regular expressions} help with conveniently specifying |
53 such patterns. The idea behind regular expressions is that |
53 such patterns. The idea behind regular expressions is that |
54 they are a simple method for describing languages (or sets of |
54 they are a simple method for describing languages (or sets of |
55 strings)\ldots at least languages we are interested in in |
55 strings)\ldots at least languages we are interested in in |
60 ``8 Regular Expressions You Should Know'' |
60 ``8 Regular Expressions You Should Know'' |
61 \url{http://goo.gl/5LoVX7}} Consider the following regular |
61 \url{http://goo.gl/5LoVX7}} Consider the following regular |
62 expression |
62 expression |
63 |
63 |
64 \begin{equation}\label{email} |
64 \begin{equation}\label{email} |
65 \texttt{[a-z0-9\_.-]+ @ [a-z0-9.-]+ . [a-z.]\{2,6\}} |
65 \texttt{[a-z0-9\_.-]+} \;\;\texttt{@}\;\; \texttt{[a-z0-9.-]+} \;\;\texttt{.}\;\; \texttt{[a-z.]\{2,6\}} |
66 \end{equation} |
66 \end{equation} |
67 |
67 |
68 \noindent where the first part matches one or more lowercase |
68 \noindent where the first part, the user name, matches one or more lowercase |
69 letters (\pcode{a-z}), digits (\pcode{0-9}), underscores, dots |
69 letters (\pcode{a-z}), digits (\pcode{0-9}), underscores, dots |
70 and hyphens. The \pcode{+} at the end of the brackets ensures |
70 and hyphens. The \pcode{+} at the end of the brackets ensures |
71 the ``one or more''. Then comes the \pcode{@}-sign, followed |
71 the ``one or more''. Then comes the \pcode{@}-sign, followed |
72 by the domain name which must be one or more lowercase |
72 by the domain name which must be one or more lowercase |
73 letters, digits, underscores, dots or hyphens. Note there |
73 letters, digits, underscores, dots or hyphens. Note there |
144 \end{center} |
144 \end{center} |
145 |
145 |
146 \noindent With this table you can figure out the purpose of |
146 \noindent With this table you can figure out the purpose of |
147 the regular expressions in the web-crawlers shown Figures |
147 the regular expressions in the web-crawlers shown Figures |
148 \ref{crawler1}, \ref{crawler2} and |
148 \ref{crawler1}, \ref{crawler2} and |
149 \ref{crawler3}.\footnote{There is an interesting twist in the |
149 \ref{crawler3}. Note, however, the regular expression for |
150 web-scraper where \pcode{re*?} is used instead of |
|
151 \pcode{re*}.} Note, however, the regular expression for |
|
152 http-addresses in web-pages in Figure~\ref{crawler1}, Line 15, |
150 http-addresses in web-pages in Figure~\ref{crawler1}, Line 15, |
153 is intended to be |
151 is intended to be |
154 |
152 |
155 \[ |
153 \[ |
156 \pcode{"https?://[^"]*"} |
154 \pcode{"https?://[^"]*"} |
257 \begin{center} |
255 \begin{center} |
258 \url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016} |
256 \url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016} |
259 \end{center} |
257 \end{center} |
260 |
258 |
261 \noindent |
259 \noindent |
|
260 It was also a problem in the Atom editor |
|
261 |
|
262 \begin{center} |
|
263 \url{http://davidvgalbraith.com/how-i-fixed-atom/} |
|
264 \end{center} |
|
265 |
|
266 \noindent |
262 Such troublesome regular expressions are sometimes called \emph{evil |
267 Such troublesome regular expressions are sometimes called \emph{evil |
263 regular expressions} because they have the potential to make regular |
268 regular expressions} because they have the potential to make regular |
264 expression matching engines to topple over, like in Python, Ruby and |
269 expression matching engines to topple over, like in Python, Ruby and |
265 Java. This `toopling over' is also sometimes called \emph{catastrophic |
270 Java. This ``toppling over'' is also sometimes called \emph{catastrophic |
266 backtracking}. The problem with evil regular expressions is that |
271 backtracking}. The problem with evil regular expressions is that |
267 they can have some serious consequences, for example, if you use them |
272 they can have some serious consequences, for example, if you use them |
268 in your web-application. The reason is that hackers can look for these |
273 in your web-application. The reason is that hackers can look for these |
269 instances where the matching engine behaves badly and then mount a |
274 instances where the matching engine behaves badly and then mount a |
270 nice DoS-attack against your application. These attacks are already |
275 nice DoS-attack against your application. These attacks are already |
271 have their own name: \emph{Regular Expression Denial of Service |
276 have their own name: \emph{Regular Expression Denial of Service |
272 Attacks (ReDoS)}. |
277 Attacks (ReDoS)}. |
273 |
278 |
274 It will be instructive to look behind the ``scenes'' to find |
279 It will be instructive to look behind the ``scenes'' to find |
275 out why Python and Ruby (and others) behave so badly when |
280 out why Python and Ruby (and others) behave so badly when |
276 matching with evil regular expressions. But we will also look |
281 matching strings with evil regular expressions. But we will also look |
277 at a relatively simple algorithm that solves this problem much |
282 at a relatively simple algorithm that solves this problem much |
278 better than Python and Ruby do\ldots actually it will be two |
283 better than Python and Ruby do\ldots actually it will be two |
279 versions of the algorithm: the first one will be able to |
284 versions of the algorithm: the first one will be able to |
280 process strings of approximately 1,000 \texttt{a}s in 30 |
285 process strings of approximately 1,000 \texttt{a}s in 30 |
281 seconds, while the second version will even be able to process |
286 seconds, while the second version will even be able to process |
590 for the last 14 years has been with theorem provers. I am a |
579 for the last 14 years has been with theorem provers. I am a |
591 core developer of one of |
580 core developer of one of |
592 them.\footnote{\url{http://isabelle.in.tum.de}} Theorem |
581 them.\footnote{\url{http://isabelle.in.tum.de}} Theorem |
593 provers are systems in which you can formally reason about |
582 provers are systems in which you can formally reason about |
594 mathematical concepts, but also about programs. In this way |
583 mathematical concepts, but also about programs. In this way |
595 they can help with writing bug-free code. Theorem provers have |
584 theorem prover can help with the manacing problem of writing bug-free code. Theorem provers have |
596 proved already their value in a number of systems (even in |
585 proved already their value in a number of cases (even in |
597 terms of hard cash), but they are still clunky and difficult |
586 terms of hard cash), but they are still clunky and difficult |
598 to use by average programmers. |
587 to use by average programmers. |
599 |
588 |
600 Anyway, in about 2011 I came across the notion of |
589 Anyway, in about 2011 I came across the notion of |
601 \defn{derivatives of regular expressions}. This notion allows |
590 \defn{derivatives of regular expressions}. This notion allows |
602 one to do almost all calculations in regular language theory |
591 one to do almost all calculations in regular language theory |
603 on the level of regular expressions, not needing any automata. |
592 on the level of regular expressions, not needing any automata (you will |
604 This is important because automata are graphs and it is rather |
593 see we only touch briefly on automata in lecture 3). |
|
594 This is crucial because automata are graphs and it is rather |
605 difficult to reason about graphs in theorem provers. In |
595 difficult to reason about graphs in theorem provers. In |
606 contrast, to reason about regular expressions is easy-peasy in |
596 contrast, reasoning about regular expressions is easy-peasy in |
607 theorem provers. Is this important? I think yes, because |
597 theorem provers. Is this important? I think yes, because |
608 according to Kuklewicz nearly all POSIX-based regular |
598 according to Kuklewicz nearly all POSIX-based regular |
609 expression matchers are |
599 expression matchers are |
610 buggy.\footnote{\url{http://www.haskell.org/haskellwiki/Regex_Posix}} |
600 buggy.\footnote{\url{http://www.haskell.org/haskellwiki/Regex_Posix}} |
611 With my PhD student Fahad Ausaf I am currently working on |
601 With my PhD student Fahad Ausaf I proved |
612 proving the correctness for one such matcher that was |
602 the correctness for one such matcher that was |
613 proposed by Sulzmann and Lu in |
603 proposed by Sulzmann and Lu in |
614 2014.\footnote{\url{http://goo.gl/bz0eHp}} This would be an |
604 2014.\footnote{\url{http://goo.gl/bz0eHp}} Hopefully we can |
615 attractive results since we will be able to prove that the |
605 prove that the machine code(!) |
616 algorithm is really correct, but also that the machine code(!) |
|
617 that implements this code efficiently is correct. Writing |
606 that implements this code efficiently is correct. Writing |
618 programs in this way does not leave any room for potential |
607 programs in this way does not leave any room for potential |
619 errors or bugs. How nice! |
608 errors or bugs. How nice! |
620 |
609 |
621 What also helped with my fascination with regular expressions |
610 What also helped with my fascination with regular expressions |
622 is that we could indeed find out new things about them that |
611 is that we could indeed find out new things about them that |
623 have surprised some experts in the field of regular |
612 have surprised some experts. Together with two colleagues from China, I was |
624 expressions. Together with two colleagues from China, I was |
|
625 able to prove the Myhill-Nerode theorem by only using regular |
613 able to prove the Myhill-Nerode theorem by only using regular |
626 expressions and the notion of derivatives. Earlier versions of |
614 expressions and the notion of derivatives. Earlier versions of |
627 this theorem used always automata in the proof. Using this |
615 this theorem used always automata in the proof. Using this |
628 theorem we can show that regular languages are closed under |
616 theorem we can show that regular languages are closed under |
629 complementation, something which Gasarch in his |
617 complementation, something which Gasarch in his |
635 What a feeling if you are an outsider to the subject! |
623 What a feeling if you are an outsider to the subject! |
636 |
624 |
637 To conclude: Despite my early ignorance about regular |
625 To conclude: Despite my early ignorance about regular |
638 expressions, I find them now very interesting. They have a |
626 expressions, I find them now very interesting. They have a |
639 beautiful mathematical theory behind them, which can be |
627 beautiful mathematical theory behind them, which can be |
640 sometimes quite deep and contain hidden snares. They have |
628 sometimes quite deep and which contains hidden snares. They have |
641 practical importance (remember the shocking runtime of the |
629 practical importance (remember the shocking runtime of the |
642 regular expression matchers in Python and Ruby in some |
630 regular expression matchers in Python, Ruby and Java in some |
643 instances). People who are not very familiar with the |
631 instances) and the problems in Stack Exchange and the Atom editor. |
|
632 People who are not very familiar with the |
644 mathematical background of regular expressions get them |
633 mathematical background of regular expressions get them |
645 consistently wrong. The hope is that we can do better in the |
634 consistently wrong. The hope is that we can do better in the |
646 future---for example by proving that the algorithms actually |
635 future---for example by proving that the algorithms actually |
647 satisfy their specification and that the corresponding |
636 satisfy their specification and that the corresponding |
648 implementations do not contain any bugs. We are close, but not |
637 implementations do not contain any bugs. We are close, but not |
670 the syntax of email addresses. With their proposed regular |
659 the syntax of email addresses. With their proposed regular |
671 expression they are too strict in some cases and too lax in |
660 expression they are too strict in some cases and too lax in |
672 others. Not a good situation to be in. A regular expression |
661 others. Not a good situation to be in. A regular expression |
673 that is claimed to be closer to the standard is shown in |
662 that is claimed to be closer to the standard is shown in |
674 Figure~\ref{monster}. Whether this claim is true or not, I |
663 Figure~\ref{monster}. Whether this claim is true or not, I |
675 would not know---the only thing I can say to this regular |
664 would not know---the only thing I can say about this regular |
676 expression is it is a monstrosity. However, this might |
665 expression is it is a monstrosity. However, this might |
677 actually be an argument against the RFC standard, rather than |
666 actually be an argument against the RFC standard, rather than |
678 against regular expressions. A similar argument is made in |
667 against regular expressions. A similar argument is made in |
679 |
668 |
680 \begin{center} |
669 \begin{center} |