--- a/handouts/ho01.tex Tue Aug 23 12:26:13 2016 +0200
+++ b/handouts/ho01.tex Tue Aug 23 15:38:20 2016 +0200
@@ -47,7 +47,7 @@
particular pattern---be it an email address, for example. In
this way we can exclude user input that would otherwise have
nasty effects on our program (crashing it or making it go into
-an infinite loop, if not worse).
+an infinite loop, if not worse).\smallskip
\defn{Regular expressions} help with conveniently specifying
such patterns. The idea behind regular expressions is that
@@ -62,10 +62,10 @@
expression
\begin{equation}\label{email}
-\texttt{[a-z0-9\_.-]+ @ [a-z0-9.-]+ . [a-z.]\{2,6\}}
+\texttt{[a-z0-9\_.-]+} \;\;\texttt{@}\;\; \texttt{[a-z0-9.-]+} \;\;\texttt{.}\;\; \texttt{[a-z.]\{2,6\}}
\end{equation}
-\noindent where the first part matches one or more lowercase
+\noindent where the first part, the user name, matches one or more lowercase
letters (\pcode{a-z}), digits (\pcode{0-9}), underscores, dots
and hyphens. The \pcode{+} at the end of the brackets ensures
the ``one or more''. Then comes the \pcode{@}-sign, followed
@@ -146,9 +146,7 @@
\noindent With this table you can figure out the purpose of
the regular expressions in the web-crawlers shown Figures
\ref{crawler1}, \ref{crawler2} and
-\ref{crawler3}.\footnote{There is an interesting twist in the
-web-scraper where \pcode{re*?} is used instead of
-\pcode{re*}.} Note, however, the regular expression for
+\ref{crawler3}. Note, however, the regular expression for
http-addresses in web-pages in Figure~\ref{crawler1}, Line 15,
is intended to be
@@ -186,7 +184,7 @@
regular expression matching in Python, Ruby and Java.
\begin{center}
-\begin{tabular}{@{}cc@{}}
+\begin{tabular}{@{\hspace{-1mm}}c@{\hspace{-1mm}}c@{}}
\begin{tikzpicture}
\begin{axis}[
title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings
@@ -259,10 +257,17 @@
\end{center}
\noindent
+It was also a problem in the Atom editor
+
+\begin{center}
+\url{http://davidvgalbraith.com/how-i-fixed-atom/}
+\end{center}
+
+\noindent
Such troublesome regular expressions are sometimes called \emph{evil
regular expressions} because they have the potential to make regular
expression matching engines to topple over, like in Python, Ruby and
-Java. This `toopling over' is also sometimes called \emph{catastrophic
+Java. This ``toppling over'' is also sometimes called \emph{catastrophic
backtracking}. The problem with evil regular expressions is that
they can have some serious consequences, for example, if you use them
in your web-application. The reason is that hackers can look for these
@@ -273,7 +278,7 @@
It will be instructive to look behind the ``scenes'' to find
out why Python and Ruby (and others) behave so badly when
-matching with evil regular expressions. But we will also look
+matching strings with evil regular expressions. But we will also look
at a relatively simple algorithm that solves this problem much
better than Python and Ruby do\ldots actually it will be two
versions of the algorithm: the first one will be able to
@@ -375,22 +380,6 @@
A\star\dn \bigcup_{0\le n} A^n
\]
-\noindent
-Note that this expands to
-
-\[
-A\star \dn A^0 \cup A^1 \cup A^2 \cup A^3 \cup A^4 \cup \ldots
-\]
-
-\noindent which is equivalent to
-
-\[
-A\star \dn \{[]\} \cup A \cup A@A \cup A@A@A \cup A@A@A@A \cup \ldots
-\]
-
-\noindent
-Remember that $A^0$ is always the set containing the empty
-string.
We will use parentheses to disambiguate regular expressions.
Parentheses are not really part of a regular expression, and
@@ -592,36 +581,35 @@
them.\footnote{\url{http://isabelle.in.tum.de}} Theorem
provers are systems in which you can formally reason about
mathematical concepts, but also about programs. In this way
-they can help with writing bug-free code. Theorem provers have
-proved already their value in a number of systems (even in
+theorem prover can help with the manacing problem of writing bug-free code. Theorem provers have
+proved already their value in a number of cases (even in
terms of hard cash), but they are still clunky and difficult
to use by average programmers.
Anyway, in about 2011 I came across the notion of
\defn{derivatives of regular expressions}. This notion allows
one to do almost all calculations in regular language theory
-on the level of regular expressions, not needing any automata.
-This is important because automata are graphs and it is rather
+on the level of regular expressions, not needing any automata (you will
+see we only touch briefly on automata in lecture 3).
+This is crucial because automata are graphs and it is rather
difficult to reason about graphs in theorem provers. In
-contrast, to reason about regular expressions is easy-peasy in
+contrast, reasoning about regular expressions is easy-peasy in
theorem provers. Is this important? I think yes, because
according to Kuklewicz nearly all POSIX-based regular
expression matchers are
buggy.\footnote{\url{http://www.haskell.org/haskellwiki/Regex_Posix}}
-With my PhD student Fahad Ausaf I am currently working on
-proving the correctness for one such matcher that was
+With my PhD student Fahad Ausaf I proved
+the correctness for one such matcher that was
proposed by Sulzmann and Lu in
-2014.\footnote{\url{http://goo.gl/bz0eHp}} This would be an
-attractive results since we will be able to prove that the
-algorithm is really correct, but also that the machine code(!)
+2014.\footnote{\url{http://goo.gl/bz0eHp}} Hopefully we can
+prove that the machine code(!)
that implements this code efficiently is correct. Writing
programs in this way does not leave any room for potential
errors or bugs. How nice!
What also helped with my fascination with regular expressions
is that we could indeed find out new things about them that
-have surprised some experts in the field of regular
-expressions. Together with two colleagues from China, I was
+have surprised some experts. Together with two colleagues from China, I was
able to prove the Myhill-Nerode theorem by only using regular
expressions and the notion of derivatives. Earlier versions of
this theorem used always automata in the proof. Using this
@@ -637,10 +625,11 @@
To conclude: Despite my early ignorance about regular
expressions, I find them now very interesting. They have a
beautiful mathematical theory behind them, which can be
-sometimes quite deep and contain hidden snares. They have
+sometimes quite deep and which contains hidden snares. They have
practical importance (remember the shocking runtime of the
-regular expression matchers in Python and Ruby in some
-instances). People who are not very familiar with the
+regular expression matchers in Python, Ruby and Java in some
+instances) and the problems in Stack Exchange and the Atom editor.
+People who are not very familiar with the
mathematical background of regular expressions get them
consistently wrong. The hope is that we can do better in the
future---for example by proving that the algorithms actually
@@ -672,7 +661,7 @@
others. Not a good situation to be in. A regular expression
that is claimed to be closer to the standard is shown in
Figure~\ref{monster}. Whether this claim is true or not, I
-would not know---the only thing I can say to this regular
+would not know---the only thing I can say about this regular
expression is it is a monstrosity. However, this might
actually be an argument against the RFC standard, rather than
against regular expressions. A similar argument is made in
@@ -735,7 +724,7 @@
\end{center}
\caption{Nothing that can be said about this regular
-expression\ldots\label{monster}}
+expression\ldots{}except it is a monstrosity.\label{monster}}
\end{figure}