handouts/ho01.tex
changeset 416 357c395ae838
parent 415 4ae59fd3b174
child 417 e74c696821a2
equal deleted inserted replaced
415:4ae59fd3b174 416:357c395ae838
    45 Also often we face the problem that we are given a string (for
    45 Also often we face the problem that we are given a string (for
    46 example some user input) and want to know whether it matches a
    46 example some user input) and want to know whether it matches a
    47 particular pattern---be it an email address, for example. In
    47 particular pattern---be it an email address, for example. In
    48 this way we can exclude user input that would otherwise have
    48 this way we can exclude user input that would otherwise have
    49 nasty effects on our program (crashing it or making it go into
    49 nasty effects on our program (crashing it or making it go into
    50 an infinite loop, if not worse).
    50 an infinite loop, if not worse).\smallskip
    51 
    51 
    52 \defn{Regular expressions} help with conveniently specifying
    52 \defn{Regular expressions} help with conveniently specifying
    53 such patterns. The idea behind regular expressions is that
    53 such patterns. The idea behind regular expressions is that
    54 they are a simple method for describing languages (or sets of
    54 they are a simple method for describing languages (or sets of
    55 strings)\ldots at least languages we are interested in in
    55 strings)\ldots at least languages we are interested in in
    60 ``8 Regular Expressions You Should Know''
    60 ``8 Regular Expressions You Should Know''
    61 \url{http://goo.gl/5LoVX7}} Consider the following regular
    61 \url{http://goo.gl/5LoVX7}} Consider the following regular
    62 expression
    62 expression
    63 
    63 
    64 \begin{equation}\label{email}
    64 \begin{equation}\label{email}
    65 \texttt{[a-z0-9\_.-]+ @ [a-z0-9.-]+ . [a-z.]\{2,6\}}
    65 \texttt{[a-z0-9\_.-]+} \;\;\texttt{@}\;\; \texttt{[a-z0-9.-]+} \;\;\texttt{.}\;\; \texttt{[a-z.]\{2,6\}}
    66 \end{equation}
    66 \end{equation}
    67 
    67 
    68 \noindent where the first part matches one or more lowercase
    68 \noindent where the first part, the user name, matches one or more lowercase
    69 letters (\pcode{a-z}), digits (\pcode{0-9}), underscores, dots
    69 letters (\pcode{a-z}), digits (\pcode{0-9}), underscores, dots
    70 and hyphens. The \pcode{+} at the end of the brackets ensures
    70 and hyphens. The \pcode{+} at the end of the brackets ensures
    71 the ``one or more''. Then comes the \pcode{@}-sign, followed
    71 the ``one or more''. Then comes the \pcode{@}-sign, followed
    72 by the domain name which must be one or more lowercase
    72 by the domain name which must be one or more lowercase
    73 letters, digits, underscores, dots or hyphens. Note there
    73 letters, digits, underscores, dots or hyphens. Note there
   144 \end{center}
   144 \end{center}
   145 
   145 
   146 \noindent With this table you can figure out the purpose of
   146 \noindent With this table you can figure out the purpose of
   147 the regular expressions in the web-crawlers shown Figures
   147 the regular expressions in the web-crawlers shown Figures
   148 \ref{crawler1}, \ref{crawler2} and
   148 \ref{crawler1}, \ref{crawler2} and
   149 \ref{crawler3}.\footnote{There is an interesting twist in the
   149 \ref{crawler3}. Note, however, the regular expression for
   150 web-scraper where \pcode{re*?} is used instead of
       
   151 \pcode{re*}.} Note, however, the regular expression for
       
   152 http-addresses in web-pages in Figure~\ref{crawler1}, Line 15,
   150 http-addresses in web-pages in Figure~\ref{crawler1}, Line 15,
   153 is intended to be
   151 is intended to be
   154 
   152 
   155 \[
   153 \[
   156 \pcode{"https?://[^"]*"}
   154 \pcode{"https?://[^"]*"}
   184 then is there any interest in studying them again in depth in
   182 then is there any interest in studying them again in depth in
   185 this module? Well, one answer is in the following two graphs about
   183 this module? Well, one answer is in the following two graphs about
   186 regular expression matching in Python, Ruby and Java.
   184 regular expression matching in Python, Ruby and Java.
   187 
   185 
   188 \begin{center}
   186 \begin{center}
   189 \begin{tabular}{@{}cc@{}}
   187 \begin{tabular}{@{\hspace{-1mm}}c@{\hspace{-1mm}}c@{}}
   190 \begin{tikzpicture}
   188 \begin{tikzpicture}
   191 \begin{axis}[
   189 \begin{axis}[
   192     title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings 
   190     title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings 
   193            $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
   191            $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
   194     xlabel={$n$},
   192     xlabel={$n$},
   257 \begin{center}
   255 \begin{center}
   258 \url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}
   256 \url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}
   259 \end{center}
   257 \end{center}
   260 
   258 
   261 \noindent
   259 \noindent
       
   260 It was also a problem in the Atom editor
       
   261 
       
   262 \begin{center}
       
   263 \url{http://davidvgalbraith.com/how-i-fixed-atom/}
       
   264 \end{center}
       
   265 
       
   266 \noindent
   262 Such troublesome regular expressions are sometimes called \emph{evil
   267 Such troublesome regular expressions are sometimes called \emph{evil
   263   regular expressions} because they have the potential to make regular
   268   regular expressions} because they have the potential to make regular
   264 expression matching engines to topple over, like in Python, Ruby and
   269 expression matching engines to topple over, like in Python, Ruby and
   265 Java. This `toopling over' is also sometimes called \emph{catastrophic
   270 Java. This ``toppling over'' is also sometimes called \emph{catastrophic
   266   backtracking}.  The problem with evil regular expressions is that
   271   backtracking}.  The problem with evil regular expressions is that
   267 they can have some serious consequences, for example, if you use them
   272 they can have some serious consequences, for example, if you use them
   268 in your web-application. The reason is that hackers can look for these
   273 in your web-application. The reason is that hackers can look for these
   269 instances where the matching engine behaves badly and then mount a
   274 instances where the matching engine behaves badly and then mount a
   270 nice DoS-attack against your application. These attacks are already
   275 nice DoS-attack against your application. These attacks are already
   271 have their own name: \emph{Regular Expression Denial of Service
   276 have their own name: \emph{Regular Expression Denial of Service
   272   Attacks (ReDoS)}.
   277   Attacks (ReDoS)}.
   273 
   278 
   274 It will be instructive to look behind the ``scenes'' to find
   279 It will be instructive to look behind the ``scenes'' to find
   275 out why Python and Ruby (and others) behave so badly when
   280 out why Python and Ruby (and others) behave so badly when
   276 matching with evil regular expressions. But we will also look
   281 matching strings with evil regular expressions. But we will also look
   277 at a relatively simple algorithm that solves this problem much
   282 at a relatively simple algorithm that solves this problem much
   278 better than Python and Ruby do\ldots actually it will be two
   283 better than Python and Ruby do\ldots actually it will be two
   279 versions of the algorithm: the first one will be able to
   284 versions of the algorithm: the first one will be able to
   280 process strings of approximately 1,000 \texttt{a}s in 30
   285 process strings of approximately 1,000 \texttt{a}s in 30
   281 seconds, while the second version will even be able to process
   286 seconds, while the second version will even be able to process
   373 
   378 
   374 \[
   379 \[
   375 A\star\dn \bigcup_{0\le n} A^n
   380 A\star\dn \bigcup_{0\le n} A^n
   376 \]
   381 \]
   377 
   382 
   378 \noindent 
       
   379 Note that this expands to
       
   380 
       
   381 \[
       
   382 A\star \dn A^0 \cup A^1 \cup A^2 \cup A^3 \cup A^4 \cup \ldots
       
   383 \]
       
   384 
       
   385 \noindent which is equivalent to
       
   386 
       
   387 \[
       
   388 A\star \dn \{[]\} \cup A \cup A@A \cup A@A@A \cup A@A@A@A \cup \ldots
       
   389 \]
       
   390 
       
   391 \noindent 
       
   392 Remember that $A^0$ is always the set containing the empty 
       
   393 string.
       
   394 
   383 
   395 We will use parentheses to disambiguate regular expressions.
   384 We will use parentheses to disambiguate regular expressions.
   396 Parentheses are not really part of a regular expression, and
   385 Parentheses are not really part of a regular expression, and
   397 indeed we do not need them in our code because there the tree
   386 indeed we do not need them in our code because there the tree
   398 structure of regular expressions is always clear. But for
   387 structure of regular expressions is always clear. But for
   590 for the last 14 years has been with theorem provers. I am a
   579 for the last 14 years has been with theorem provers. I am a
   591 core developer of one of
   580 core developer of one of
   592 them.\footnote{\url{http://isabelle.in.tum.de}} Theorem
   581 them.\footnote{\url{http://isabelle.in.tum.de}} Theorem
   593 provers are systems in which you can formally reason about
   582 provers are systems in which you can formally reason about
   594 mathematical concepts, but also about programs. In this way
   583 mathematical concepts, but also about programs. In this way
   595 they can help with writing bug-free code. Theorem provers have
   584 theorem prover can help with the manacing problem of writing bug-free code. Theorem provers have
   596 proved already their value in a number of systems (even in
   585 proved already their value in a number of cases (even in
   597 terms of hard cash), but they are still clunky and difficult
   586 terms of hard cash), but they are still clunky and difficult
   598 to use by average programmers.
   587 to use by average programmers.
   599 
   588 
   600 Anyway, in about 2011 I came across the notion of
   589 Anyway, in about 2011 I came across the notion of
   601 \defn{derivatives of regular expressions}. This notion allows
   590 \defn{derivatives of regular expressions}. This notion allows
   602 one to do almost all calculations in regular language theory
   591 one to do almost all calculations in regular language theory
   603 on the level of regular expressions, not needing any automata.
   592 on the level of regular expressions, not needing any automata (you will
   604 This is important because automata are graphs and it is rather
   593 see we only touch briefly on automata in lecture 3).
       
   594 This is crucial because automata are graphs and it is rather
   605 difficult to reason about graphs in theorem provers. In
   595 difficult to reason about graphs in theorem provers. In
   606 contrast, to reason about regular expressions is easy-peasy in
   596 contrast, reasoning about regular expressions is easy-peasy in
   607 theorem provers. Is this important? I think yes, because
   597 theorem provers. Is this important? I think yes, because
   608 according to Kuklewicz nearly all POSIX-based regular
   598 according to Kuklewicz nearly all POSIX-based regular
   609 expression matchers are
   599 expression matchers are
   610 buggy.\footnote{\url{http://www.haskell.org/haskellwiki/Regex_Posix}}
   600 buggy.\footnote{\url{http://www.haskell.org/haskellwiki/Regex_Posix}}
   611 With my PhD student Fahad Ausaf I am currently working on
   601 With my PhD student Fahad Ausaf I proved
   612 proving the correctness for one such matcher that was
   602 the correctness for one such matcher that was
   613 proposed by Sulzmann and Lu in
   603 proposed by Sulzmann and Lu in
   614 2014.\footnote{\url{http://goo.gl/bz0eHp}} This would be an
   604 2014.\footnote{\url{http://goo.gl/bz0eHp}} Hopefully we can
   615 attractive results since we will be able to prove that the
   605 prove that the machine code(!)
   616 algorithm is really correct, but also that the machine code(!)
       
   617 that implements this code efficiently is correct. Writing
   606 that implements this code efficiently is correct. Writing
   618 programs in this way does not leave any room for potential
   607 programs in this way does not leave any room for potential
   619 errors or bugs. How nice!
   608 errors or bugs. How nice!
   620 
   609 
   621 What also helped with my fascination with regular expressions
   610 What also helped with my fascination with regular expressions
   622 is that we could indeed find out new things about them that
   611 is that we could indeed find out new things about them that
   623 have surprised some experts in the field of regular
   612 have surprised some experts. Together with two colleagues from China, I was
   624 expressions. Together with two colleagues from China, I was
       
   625 able to prove the Myhill-Nerode theorem by only using regular
   613 able to prove the Myhill-Nerode theorem by only using regular
   626 expressions and the notion of derivatives. Earlier versions of
   614 expressions and the notion of derivatives. Earlier versions of
   627 this theorem used always automata in the proof. Using this
   615 this theorem used always automata in the proof. Using this
   628 theorem we can show that regular languages are closed under
   616 theorem we can show that regular languages are closed under
   629 complementation, something which Gasarch in his
   617 complementation, something which Gasarch in his
   635 What a feeling if you are an outsider to the subject!
   623 What a feeling if you are an outsider to the subject!
   636 
   624 
   637 To conclude: Despite my early ignorance about regular
   625 To conclude: Despite my early ignorance about regular
   638 expressions, I find them now very interesting. They have a
   626 expressions, I find them now very interesting. They have a
   639 beautiful mathematical theory behind them, which can be
   627 beautiful mathematical theory behind them, which can be
   640 sometimes quite deep and contain hidden snares. They have
   628 sometimes quite deep and which contains hidden snares. They have
   641 practical importance (remember the shocking runtime of the
   629 practical importance (remember the shocking runtime of the
   642 regular expression matchers in Python and Ruby in some
   630 regular expression matchers in Python, Ruby and Java in some
   643 instances). People who are not very familiar with the
   631 instances) and the problems in Stack Exchange and the Atom editor. 
       
   632 People who are not very familiar with the
   644 mathematical background of regular expressions get them
   633 mathematical background of regular expressions get them
   645 consistently wrong. The hope is that we can do better in the
   634 consistently wrong. The hope is that we can do better in the
   646 future---for example by proving that the algorithms actually
   635 future---for example by proving that the algorithms actually
   647 satisfy their specification and that the corresponding
   636 satisfy their specification and that the corresponding
   648 implementations do not contain any bugs. We are close, but not
   637 implementations do not contain any bugs. We are close, but not
   670 the syntax of email addresses. With their proposed regular
   659 the syntax of email addresses. With their proposed regular
   671 expression they are too strict in some cases and too lax in
   660 expression they are too strict in some cases and too lax in
   672 others. Not a good situation to be in. A regular expression
   661 others. Not a good situation to be in. A regular expression
   673 that is claimed to be closer to the standard is shown in
   662 that is claimed to be closer to the standard is shown in
   674 Figure~\ref{monster}. Whether this claim is true or not, I
   663 Figure~\ref{monster}. Whether this claim is true or not, I
   675 would not know---the only thing I can say to this regular
   664 would not know---the only thing I can say about this regular
   676 expression is it is a monstrosity. However, this might
   665 expression is it is a monstrosity. However, this might
   677 actually be an argument against the RFC standard, rather than
   666 actually be an argument against the RFC standard, rather than
   678 against regular expressions. A similar argument is made in
   667 against regular expressions. A similar argument is made in
   679 
   668 
   680 \begin{center}
   669 \begin{center}
   733 \lstinputlisting[language={},keywordstyle=\color{black},numbers=none]{../progs/email-rexp}
   722 \lstinputlisting[language={},keywordstyle=\color{black},numbers=none]{../progs/email-rexp}
   734 \end{minipage}
   723 \end{minipage}
   735 \end{center}
   724 \end{center}
   736 
   725 
   737 \caption{Nothing that can be said about this regular
   726 \caption{Nothing that can be said about this regular
   738 expression\ldots\label{monster}}
   727 expression\ldots{}except it is a monstrosity.\label{monster}}
   739 \end{figure}
   728 \end{figure}
   740 
   729 
   741 
   730 
   742 \end{document}
   731 \end{document}
   743 
   732