handouts/ho01.tex
changeset 416 357c395ae838
parent 415 4ae59fd3b174
child 417 e74c696821a2
--- a/handouts/ho01.tex	Tue Aug 23 12:26:13 2016 +0200
+++ b/handouts/ho01.tex	Tue Aug 23 15:38:20 2016 +0200
@@ -47,7 +47,7 @@
 particular pattern---be it an email address, for example. In
 this way we can exclude user input that would otherwise have
 nasty effects on our program (crashing it or making it go into
-an infinite loop, if not worse).
+an infinite loop, if not worse).\smallskip
 
 \defn{Regular expressions} help with conveniently specifying
 such patterns. The idea behind regular expressions is that
@@ -62,10 +62,10 @@
 expression
 
 \begin{equation}\label{email}
-\texttt{[a-z0-9\_.-]+ @ [a-z0-9.-]+ . [a-z.]\{2,6\}}
+\texttt{[a-z0-9\_.-]+} \;\;\texttt{@}\;\; \texttt{[a-z0-9.-]+} \;\;\texttt{.}\;\; \texttt{[a-z.]\{2,6\}}
 \end{equation}
 
-\noindent where the first part matches one or more lowercase
+\noindent where the first part, the user name, matches one or more lowercase
 letters (\pcode{a-z}), digits (\pcode{0-9}), underscores, dots
 and hyphens. The \pcode{+} at the end of the brackets ensures
 the ``one or more''. Then comes the \pcode{@}-sign, followed
@@ -146,9 +146,7 @@
 \noindent With this table you can figure out the purpose of
 the regular expressions in the web-crawlers shown Figures
 \ref{crawler1}, \ref{crawler2} and
-\ref{crawler3}.\footnote{There is an interesting twist in the
-web-scraper where \pcode{re*?} is used instead of
-\pcode{re*}.} Note, however, the regular expression for
+\ref{crawler3}. Note, however, the regular expression for
 http-addresses in web-pages in Figure~\ref{crawler1}, Line 15,
 is intended to be
 
@@ -186,7 +184,7 @@
 regular expression matching in Python, Ruby and Java.
 
 \begin{center}
-\begin{tabular}{@{}cc@{}}
+\begin{tabular}{@{\hspace{-1mm}}c@{\hspace{-1mm}}c@{}}
 \begin{tikzpicture}
 \begin{axis}[
     title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings 
@@ -259,10 +257,17 @@
 \end{center}
 
 \noindent
+It was also a problem in the Atom editor
+
+\begin{center}
+\url{http://davidvgalbraith.com/how-i-fixed-atom/}
+\end{center}
+
+\noindent
 Such troublesome regular expressions are sometimes called \emph{evil
   regular expressions} because they have the potential to make regular
 expression matching engines to topple over, like in Python, Ruby and
-Java. This `toopling over' is also sometimes called \emph{catastrophic
+Java. This ``toppling over'' is also sometimes called \emph{catastrophic
   backtracking}.  The problem with evil regular expressions is that
 they can have some serious consequences, for example, if you use them
 in your web-application. The reason is that hackers can look for these
@@ -273,7 +278,7 @@
 
 It will be instructive to look behind the ``scenes'' to find
 out why Python and Ruby (and others) behave so badly when
-matching with evil regular expressions. But we will also look
+matching strings with evil regular expressions. But we will also look
 at a relatively simple algorithm that solves this problem much
 better than Python and Ruby do\ldots actually it will be two
 versions of the algorithm: the first one will be able to
@@ -375,22 +380,6 @@
 A\star\dn \bigcup_{0\le n} A^n
 \]
 
-\noindent 
-Note that this expands to
-
-\[
-A\star \dn A^0 \cup A^1 \cup A^2 \cup A^3 \cup A^4 \cup \ldots
-\]
-
-\noindent which is equivalent to
-
-\[
-A\star \dn \{[]\} \cup A \cup A@A \cup A@A@A \cup A@A@A@A \cup \ldots
-\]
-
-\noindent 
-Remember that $A^0$ is always the set containing the empty 
-string.
 
 We will use parentheses to disambiguate regular expressions.
 Parentheses are not really part of a regular expression, and
@@ -592,36 +581,35 @@
 them.\footnote{\url{http://isabelle.in.tum.de}} Theorem
 provers are systems in which you can formally reason about
 mathematical concepts, but also about programs. In this way
-they can help with writing bug-free code. Theorem provers have
-proved already their value in a number of systems (even in
+theorem prover can help with the manacing problem of writing bug-free code. Theorem provers have
+proved already their value in a number of cases (even in
 terms of hard cash), but they are still clunky and difficult
 to use by average programmers.
 
 Anyway, in about 2011 I came across the notion of
 \defn{derivatives of regular expressions}. This notion allows
 one to do almost all calculations in regular language theory
-on the level of regular expressions, not needing any automata.
-This is important because automata are graphs and it is rather
+on the level of regular expressions, not needing any automata (you will
+see we only touch briefly on automata in lecture 3).
+This is crucial because automata are graphs and it is rather
 difficult to reason about graphs in theorem provers. In
-contrast, to reason about regular expressions is easy-peasy in
+contrast, reasoning about regular expressions is easy-peasy in
 theorem provers. Is this important? I think yes, because
 according to Kuklewicz nearly all POSIX-based regular
 expression matchers are
 buggy.\footnote{\url{http://www.haskell.org/haskellwiki/Regex_Posix}}
-With my PhD student Fahad Ausaf I am currently working on
-proving the correctness for one such matcher that was
+With my PhD student Fahad Ausaf I proved
+the correctness for one such matcher that was
 proposed by Sulzmann and Lu in
-2014.\footnote{\url{http://goo.gl/bz0eHp}} This would be an
-attractive results since we will be able to prove that the
-algorithm is really correct, but also that the machine code(!)
+2014.\footnote{\url{http://goo.gl/bz0eHp}} Hopefully we can
+prove that the machine code(!)
 that implements this code efficiently is correct. Writing
 programs in this way does not leave any room for potential
 errors or bugs. How nice!
 
 What also helped with my fascination with regular expressions
 is that we could indeed find out new things about them that
-have surprised some experts in the field of regular
-expressions. Together with two colleagues from China, I was
+have surprised some experts. Together with two colleagues from China, I was
 able to prove the Myhill-Nerode theorem by only using regular
 expressions and the notion of derivatives. Earlier versions of
 this theorem used always automata in the proof. Using this
@@ -637,10 +625,11 @@
 To conclude: Despite my early ignorance about regular
 expressions, I find them now very interesting. They have a
 beautiful mathematical theory behind them, which can be
-sometimes quite deep and contain hidden snares. They have
+sometimes quite deep and which contains hidden snares. They have
 practical importance (remember the shocking runtime of the
-regular expression matchers in Python and Ruby in some
-instances). People who are not very familiar with the
+regular expression matchers in Python, Ruby and Java in some
+instances) and the problems in Stack Exchange and the Atom editor. 
+People who are not very familiar with the
 mathematical background of regular expressions get them
 consistently wrong. The hope is that we can do better in the
 future---for example by proving that the algorithms actually
@@ -672,7 +661,7 @@
 others. Not a good situation to be in. A regular expression
 that is claimed to be closer to the standard is shown in
 Figure~\ref{monster}. Whether this claim is true or not, I
-would not know---the only thing I can say to this regular
+would not know---the only thing I can say about this regular
 expression is it is a monstrosity. However, this might
 actually be an argument against the RFC standard, rather than
 against regular expressions. A similar argument is made in
@@ -735,7 +724,7 @@
 \end{center}
 
 \caption{Nothing that can be said about this regular
-expression\ldots\label{monster}}
+expression\ldots{}except it is a monstrosity.\label{monster}}
 \end{figure}