--- a/handouts/ho01.tex Thu Apr 18 14:16:31 2019 +0100
+++ b/handouts/ho01.tex Wed Jul 17 11:17:52 2019 +0100
@@ -1,3 +1,4 @@
+% !TEX program = xelatex
\documentclass{article}
\usepackage{../style}
\usepackage{../langs}
@@ -40,28 +41,33 @@
\section*{Handout 1}
-This module is about text processing, be it for web-crawlers,
-compilers, dictionaries, DNA-data, ad filters and so on. When looking
-for a particular string, like \pcode{"foobar"} in a large text we can
-use the Knuth-Morris-Pratt algorithm, which is currently the most
-efficient general string search algorithm. But often we do \emph{not}
-just look for a particular string, but for string patterns. For
-example, in program code we need to identify what are the keywords
-(\texttt{if}, \texttt{then}, \texttt{while}, \texttt{for}, etc), what
-are the identifiers (variable names). A pattern for identifiers could
-be stated as: they start with a letter, followed by zero or more
-letters, numbers and underscores. You might also be surprised what
-constraints programming languages impose about numbers: for example
-123 in JSON is OK, but 0123 is not.
+The first part of this module is about text processing, be it for
+web-crawlers, compilers, dictionaries, DNA-data, ad filters and so on.
+When looking for a particular string, say \pcode{"foobar"}, in a large
+text we can use the Knuth-Morris-Pratt algorithm, which is currently the
+most efficient general string search algorithm. But often we do
+\emph{not} just look for a particular string, but for string patterns.
+For example, in program code we need to identify what are the keywords
+(\texttt{if}, \texttt{then}, \texttt{while}, \texttt{for}, etc) and what
+are the identifiers (variable names). A pattern for identifiers could be
+stated as: they start with a letter, followed by zero or more letters,
+numbers and underscores.
-Often we also face the problem that we are given a string (for example
-some user input) and want to know whether it matches a particular
+%You might also be surprised what
+%constraints programming languages impose about numbers: for example
+%123 in JSON is OK, but 0123 is not.
+
+Often we also face the problem that we are given a string, for example
+some user input, and we want to know whether it matches a particular
pattern---be it an email address, for example. In this way we can
exclude user input that would otherwise have nasty effects on our
program (crashing it or making it go into an infinite loop, if not
-worse). In tools like Snort, scanning for computer viruses or
+worse). This kind of ``inspecting'' mechanism is implemented in popular
+network security tools such as Snort and
+Bro.\footnote{\url{www.snort.org}, \url{www.bro.org}} They scan incoming
+network traffic for computer viruses or malicious packets. Also
filtering out spam usually involves scanning for some signature
-(essentially a string pattern). The point is that the fast
+(essentially a string pattern). The point is that the fast
Knuth-Morris-Pratt algorithm for strings is not good enough for such
string \emph{patterns}.\smallskip
@@ -81,16 +87,15 @@
\texttt{[a-z0-9\_.-]+} \;\;\texttt{@}\;\; \texttt{[a-z0-9.-]+} \;\;\texttt{.}\;\; \texttt{[a-z.]\{2,6\}}
\end{equation}
-\noindent where the first part, the user name, matches one or more lowercase
-letters (\pcode{a-z}), digits (\pcode{0-9}), underscores, dots
-and hyphens. The \pcode{+} at the end of the brackets ensures
-the ``one or more''. Then comes the email \pcode{@}-sign, followed
-by the domain name which must be one or more lowercase
-letters, digits, underscores, dots or hyphens. Note there
-cannot be an underscore in the domain name. Finally there must
-be a dot followed by the toplevel domain. This toplevel domain
-must be 2 to 6 lowercase letters including the dot. Example
-strings which follow this pattern are:
+\noindent where the first part, the user name, matches one or more
+lowercase letters (\pcode{a-z}), digits (\pcode{0-9}), underscores, dots
+and hyphens. The \pcode{+} at the end of the brackets ensures the ``one
+or more''. Then comes the email \pcode{@}-sign, followed by the domain
+name which must be one or more lowercase letters, digits, underscores,
+dots or hyphens (but no underscores). Finally there must be a dot
+followed by the toplevel domain. This toplevel domain must be 2 to 6
+lowercase letters including the dot. Example strings which follow this
+pattern are:
\begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}]
niceandsimple@example.org
@@ -182,13 +187,11 @@
\pcode{""""https?://[^"]*"""".r}
\]
-\noindent Note also that the convention in Scala is that
-\texttt{.r} converts a string into a regular expression. I
-leave it to you to ponder whether this regular expression
-really captures all possible web-addresses. If you need a quick
-recap about regular expressions and how the match strings,
-here is a quick video
-\url{https://youtu.be/bgBWp9EIlMM}.
+\noindent The convention in Scala is that \texttt{.r} converts a string
+into a regular expression. I leave it to you to ponder whether this
+regular expression really captures all possible web-addresses. If you
+need a quick recap about regular expressions and how the match strings,
+here is a quick video: \url{https://youtu.be/bgBWp9EIlMM}.
\subsection*{Why Study Regular Expressions?}
@@ -196,13 +199,15 @@
have been object of intense study since then. They are nowadays pretty
much ubiquitous in computer science. There are many libraries
implementing regular expressions. I am sure you have come across them
-before (remember the PRA module?). Why on earth then is there any
-interest in studying them again in depth in this module? Well, one
-answer is in the following two graphs about regular expression
-matching in Python, Ruby, JavaScript and Java (Version 8).
+before (remember the PRA module?).
+
+Why on earth then is there any interest in studying them again in depth
+in this module? Well, one answer is in the following two graphs about
+regular expression matching in Python, Ruby, JavaScript and Java
+(Version 8).
\begin{center}
-\begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1mm}}c@{}}
+\begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1.5mm}}c@{}}
\begin{tikzpicture}
\begin{axis}[
title={Graph: $\texttt{(a*)*\,b}$ and strings
@@ -256,19 +261,19 @@
\noindent This first graph shows that Python, JavaScript and Java 8 need
approximately 30 seconds to find out that the regular expression
-$\texttt{(a*)*\,b}$ does not match strings of 28 \texttt{a}s.
-Similarly, the second shows that Python needs approximately 29 seconds
-for finding out whether a string of 28 \texttt{a}s matches the regular
-expression \texttt{a?\{28\}\,a\{28\}}. Ruby is even slightly
-worse.\footnote{In this example Ruby uses the slightly different
- regular expression \texttt{a?a?a?...a?a?aaa...aa}, where the
- \texttt{a?} and \texttt{a} each occur $n$ times. More such test
- cases can be found at \url{https://www.owasp.org/index.php/Regular_expression_Denial_of_Service_-_ReDoS}.}
+$\texttt{(a*)*\,b}$ does not match strings of 28 \texttt{a}s. Similarly,
+the second shows that Python and Ruby need approximately 29 seconds for finding
+out whether a string of 28 \texttt{a}s matches the regular expression
+\texttt{a?\{28\}\,a\{28\}}.\footnote{In this
+example Ruby uses actually the slightly different regular expression
+\texttt{a?a?a?...a?a?aaa...aa}, where the \texttt{a?} and \texttt{a}
+each occur $n$ times. More such test cases can be found at
+\url{https://www.owasp.org/index.php/Regular_expression_Denial_of_Service_-_ReDoS}.}
Admittedly, these regular expressions are carefully chosen to exhibit
this exponential behaviour, but similar ones occur more often than one
wants in ``real life''. For example, on 20 July 2016 a similar regular
expression brought the webpage \href{http://stackexchange.com}{Stack
- Exchange} to its knees:
+Exchange} to its knees:
\begin{center}
\url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}
@@ -289,25 +294,33 @@
\end{center}
\noindent
-and when somebody tried to match web-addresses using regular
-expressions
+and also when somebody tried to match web-addresses using a
+relatively simple regular expression
\begin{center}
\url{https://www.tutorialdocs.com/article/regex-trap.html}
\end{center}
+\noindent
+Finally, on 2 July 2019 Cloudflare had a global outage because of a
+regular expression (they had no outage for the last 6 years). What
+happened is nicely explained in the blog:
+
+\begin{center}
+\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019}
+\end{center}
Such troublesome regular expressions are sometimes called \emph{evil
regular expressions} because they have the potential to make regular
-expression matching engines to topple over, like in Python, Ruby and
-Java 8. This ``toppling over'' is also sometimes called
-\emph{catastrophic backtracking}. I have also seen the term
-\emph{eternal matching} used for this. The problem with evil regular
-expressions is that they can have some serious consequences, for
-example, if you use them in your web-application. The reason is that
-hackers can look for these instances where the matching engine behaves
-badly and then mount a nice DoS-attack against your application. These
-attacks are already have their own name: \emph{Regular Expression
+ expression matching engines to topple over, like in Python, Ruby,
+ JavaScript and Java 8. This ``toppling over'' is also sometimes called
+ \emph{catastrophic backtracking}. I have also seen the term
+ \emph{eternal matching} used for this. The problem with evil regular
+ expressions is that they can have some serious consequences, for
+ example, if you use them in your web-application. The reason is that
+ hackers can look for these instances where the matching engine behaves
+ badly and then mount a nice DoS-attack against your application. These
+ attacks are already have their own name: \emph{Regular Expression
Denial of Service Attacks (ReDoS)}.
It will be instructive to look behind the ``scenes'' to find