diff -r 4ae59fd3b174 -r 357c395ae838 handouts/ho01.tex --- a/handouts/ho01.tex Tue Aug 23 12:26:13 2016 +0200 +++ b/handouts/ho01.tex Tue Aug 23 15:38:20 2016 +0200 @@ -47,7 +47,7 @@ particular pattern---be it an email address, for example. In this way we can exclude user input that would otherwise have nasty effects on our program (crashing it or making it go into -an infinite loop, if not worse). +an infinite loop, if not worse).\smallskip \defn{Regular expressions} help with conveniently specifying such patterns. The idea behind regular expressions is that @@ -62,10 +62,10 @@ expression \begin{equation}\label{email} -\texttt{[a-z0-9\_.-]+ @ [a-z0-9.-]+ . [a-z.]\{2,6\}} +\texttt{[a-z0-9\_.-]+} \;\;\texttt{@}\;\; \texttt{[a-z0-9.-]+} \;\;\texttt{.}\;\; \texttt{[a-z.]\{2,6\}} \end{equation} -\noindent where the first part matches one or more lowercase +\noindent where the first part, the user name, matches one or more lowercase letters (\pcode{a-z}), digits (\pcode{0-9}), underscores, dots and hyphens. The \pcode{+} at the end of the brackets ensures the ``one or more''. Then comes the \pcode{@}-sign, followed @@ -146,9 +146,7 @@ \noindent With this table you can figure out the purpose of the regular expressions in the web-crawlers shown Figures \ref{crawler1}, \ref{crawler2} and -\ref{crawler3}.\footnote{There is an interesting twist in the -web-scraper where \pcode{re*?} is used instead of -\pcode{re*}.} Note, however, the regular expression for +\ref{crawler3}. Note, however, the regular expression for http-addresses in web-pages in Figure~\ref{crawler1}, Line 15, is intended to be @@ -186,7 +184,7 @@ regular expression matching in Python, Ruby and Java. \begin{center} -\begin{tabular}{@{}cc@{}} +\begin{tabular}{@{\hspace{-1mm}}c@{\hspace{-1mm}}c@{}} \begin{tikzpicture} \begin{axis}[ title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings @@ -259,10 +257,17 @@ \end{center} \noindent +It was also a problem in the Atom editor + +\begin{center} +\url{http://davidvgalbraith.com/how-i-fixed-atom/} +\end{center} + +\noindent Such troublesome regular expressions are sometimes called \emph{evil regular expressions} because they have the potential to make regular expression matching engines to topple over, like in Python, Ruby and -Java. This `toopling over' is also sometimes called \emph{catastrophic +Java. This ``toppling over'' is also sometimes called \emph{catastrophic backtracking}. The problem with evil regular expressions is that they can have some serious consequences, for example, if you use them in your web-application. The reason is that hackers can look for these @@ -273,7 +278,7 @@ It will be instructive to look behind the ``scenes'' to find out why Python and Ruby (and others) behave so badly when -matching with evil regular expressions. But we will also look +matching strings with evil regular expressions. But we will also look at a relatively simple algorithm that solves this problem much better than Python and Ruby do\ldots actually it will be two versions of the algorithm: the first one will be able to @@ -375,22 +380,6 @@ A\star\dn \bigcup_{0\le n} A^n \] -\noindent -Note that this expands to - -\[ -A\star \dn A^0 \cup A^1 \cup A^2 \cup A^3 \cup A^4 \cup \ldots -\] - -\noindent which is equivalent to - -\[ -A\star \dn \{[]\} \cup A \cup A@A \cup A@A@A \cup A@A@A@A \cup \ldots -\] - -\noindent -Remember that $A^0$ is always the set containing the empty -string. We will use parentheses to disambiguate regular expressions. Parentheses are not really part of a regular expression, and @@ -592,36 +581,35 @@ them.\footnote{\url{http://isabelle.in.tum.de}} Theorem provers are systems in which you can formally reason about mathematical concepts, but also about programs. In this way -they can help with writing bug-free code. Theorem provers have -proved already their value in a number of systems (even in +theorem prover can help with the manacing problem of writing bug-free code. Theorem provers have +proved already their value in a number of cases (even in terms of hard cash), but they are still clunky and difficult to use by average programmers. Anyway, in about 2011 I came across the notion of \defn{derivatives of regular expressions}. This notion allows one to do almost all calculations in regular language theory -on the level of regular expressions, not needing any automata. -This is important because automata are graphs and it is rather +on the level of regular expressions, not needing any automata (you will +see we only touch briefly on automata in lecture 3). +This is crucial because automata are graphs and it is rather difficult to reason about graphs in theorem provers. In -contrast, to reason about regular expressions is easy-peasy in +contrast, reasoning about regular expressions is easy-peasy in theorem provers. Is this important? I think yes, because according to Kuklewicz nearly all POSIX-based regular expression matchers are buggy.\footnote{\url{http://www.haskell.org/haskellwiki/Regex_Posix}} -With my PhD student Fahad Ausaf I am currently working on -proving the correctness for one such matcher that was +With my PhD student Fahad Ausaf I proved +the correctness for one such matcher that was proposed by Sulzmann and Lu in -2014.\footnote{\url{http://goo.gl/bz0eHp}} This would be an -attractive results since we will be able to prove that the -algorithm is really correct, but also that the machine code(!) +2014.\footnote{\url{http://goo.gl/bz0eHp}} Hopefully we can +prove that the machine code(!) that implements this code efficiently is correct. Writing programs in this way does not leave any room for potential errors or bugs. How nice! What also helped with my fascination with regular expressions is that we could indeed find out new things about them that -have surprised some experts in the field of regular -expressions. Together with two colleagues from China, I was +have surprised some experts. Together with two colleagues from China, I was able to prove the Myhill-Nerode theorem by only using regular expressions and the notion of derivatives. Earlier versions of this theorem used always automata in the proof. Using this @@ -637,10 +625,11 @@ To conclude: Despite my early ignorance about regular expressions, I find them now very interesting. They have a beautiful mathematical theory behind them, which can be -sometimes quite deep and contain hidden snares. They have +sometimes quite deep and which contains hidden snares. They have practical importance (remember the shocking runtime of the -regular expression matchers in Python and Ruby in some -instances). People who are not very familiar with the +regular expression matchers in Python, Ruby and Java in some +instances) and the problems in Stack Exchange and the Atom editor. +People who are not very familiar with the mathematical background of regular expressions get them consistently wrong. The hope is that we can do better in the future---for example by proving that the algorithms actually @@ -672,7 +661,7 @@ others. Not a good situation to be in. A regular expression that is claimed to be closer to the standard is shown in Figure~\ref{monster}. Whether this claim is true or not, I -would not know---the only thing I can say to this regular +would not know---the only thing I can say about this regular expression is it is a monstrosity. However, this might actually be an argument against the RFC standard, rather than against regular expressions. A similar argument is made in @@ -735,7 +724,7 @@ \end{center} \caption{Nothing that can be said about this regular -expression\ldots\label{monster}} +expression\ldots{}except it is a monstrosity.\label{monster}} \end{figure}