afl-material: comparison handouts/ho01.tex

equal deleted inserted replaced

-:558f911f7802
+:cf287db8dc15
+% !TEX program = xelatex
 \documentclass{article}
 \usepackage{../style}
 \usepackage{../langs}
 \usepackage{../graphics}
 \usepackage{../data}
 \begin{document}
 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2018, 2019}
 \section*{Handout 1}
-This module is about text processing, be it for web-crawlers,
+The first part of this module is about text processing, be it for
-compilers, dictionaries, DNA-data, ad filters and so on.  When looking
+web-crawlers, compilers, dictionaries, DNA-data, ad filters and so on.
-for a particular string, like \pcode{"foobar"} in a large text we can
+When looking for a particular string, say \pcode{"foobar"}, in a large
-use the Knuth-Morris-Pratt algorithm, which is currently the most
+text we can use the Knuth-Morris-Pratt algorithm, which is currently the
-efficient general string search algorithm. But often we do \emph{not}
+most efficient general string search algorithm. But often we do
-just look for a particular string, but for string patterns. For
+\emph{not} just look for a particular string, but for string patterns.
-example, in program code we need to identify what are the keywords
+For example, in program code we need to identify what are the keywords
-(\texttt{if}, \texttt{then}, \texttt{while}, \texttt{for}, etc), what
+(\texttt{if}, \texttt{then}, \texttt{while}, \texttt{for}, etc) and what
-are the identifiers (variable names). A pattern for identifiers could
+are the identifiers (variable names). A pattern for identifiers could be
-be stated as: they start with a letter, followed by zero or more
+stated as: they start with a letter, followed by zero or more letters,
-letters, numbers and underscores. You might also be surprised what
+numbers and underscores.
-constraints programming languages impose about numbers: for example
-123 in JSON is OK, but 0123 is not.
+%You might also be surprised what
+%constraints programming languages impose about numbers: for example
-Often we also face the problem that we are given a string (for example
+%123 in JSON is OK, but 0123 is not.
-some user input) and want to know whether it matches a particular
+Often we also face the problem that we are given a string, for example
+some user input, and we want to know whether it matches a particular
 pattern---be it an email address, for example. In this way we can
 exclude user input that would otherwise have nasty effects on our
 program (crashing it or making it go into an infinite loop, if not
-worse). In tools like Snort, scanning for computer viruses or
+worse). This kind of ``inspecting'' mechanism is implemented in popular
+network security tools such as Snort and
+Bro.\footnote{\url{www.snort.org}, \url{www.bro.org}} They scan incoming
+network traffic for computer viruses or malicious packets. Also
 filtering out spam usually involves scanning for some signature
-(essentially a string pattern).  The point is that the fast
+(essentially a string pattern). The point is that the fast
 Knuth-Morris-Pratt algorithm for strings is not good enough for such
 string \emph{patterns}.\smallskip
 \defn{Regular expressions} help with conveniently specifying
 such patterns. The idea behind regular expressions is that
 \begin{equation}\label{email}
 \texttt{[a-z0-9\_.-]+} \;\;\texttt{@}\;\; \texttt{[a-z0-9.-]+} \;\;\texttt{.}\;\; \texttt{[a-z.]\{2,6\}}
 \end{equation}
-\noindent where the first part, the user name, matches one or more lowercase
+\noindent where the first part, the user name, matches one or more
-letters (\pcode{a-z}), digits (\pcode{0-9}), underscores, dots
+lowercase letters (\pcode{a-z}), digits (\pcode{0-9}), underscores, dots
-and hyphens. The \pcode{+} at the end of the brackets ensures
+and hyphens. The \pcode{+} at the end of the brackets ensures the ``one
-the ``one or more''. Then comes the email \pcode{@}-sign, followed
+or more''. Then comes the email \pcode{@}-sign, followed by the domain
-by the domain name which must be one or more lowercase
+name which must be one or more lowercase letters, digits, underscores,
-letters, digits, underscores, dots or hyphens. Note there
+dots or hyphens (but no underscores).  Finally there must be a dot
-cannot be an underscore in the domain name. Finally there must
+followed by the toplevel domain. This toplevel domain must be 2 to 6
-be a dot followed by the toplevel domain. This toplevel domain
+lowercase letters including the dot. Example strings which follow this
-must be 2 to 6 lowercase letters including the dot. Example
+pattern are:
-strings which follow this pattern are:
 \begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}]
 niceandsimple@example.org
 very.common@example.co.uk
 a.little.lengthy.but.fine@dept.example.ac.uk
 \[
 \pcode{""""https?://[^"]*"""".r}
 \]
-\noindent Note also that the convention in Scala is that
+\noindent The convention in Scala is that \texttt{.r} converts a string
-\texttt{.r} converts a string into a regular expression. I
+into a regular expression. I leave it to you to ponder whether this
-leave it to you to ponder whether this regular expression
+regular expression really captures all possible web-addresses. If you
-really captures all possible web-addresses. If you need a quick
+need a quick recap about regular expressions and how the match strings,
-recap about regular expressions and how the match strings,
+here is a quick video: \url{https://youtu.be/bgBWp9EIlMM}.
-here is a quick video
-\url{https://youtu.be/bgBWp9EIlMM}.
 \subsection*{Why Study Regular Expressions?}
 Regular expressions were introduced by Kleene in the 1950ies and they
 have been object of intense study since then. They are nowadays pretty
 much ubiquitous in computer science. There are many libraries
 implementing regular expressions. I am sure you have come across them
-before (remember the PRA module?). Why on earth then is there any
+before (remember the PRA module?).
-interest in studying them again in depth in this module? Well, one
-answer is in the following two graphs about regular expression
+Why on earth then is there any interest in studying them again in depth
-matching in Python, Ruby, JavaScript and Java (Version 8).
+in this module? Well, one answer is in the following two graphs about
+regular expression matching in Python, Ruby, JavaScript and Java
-\begin{center}
+(Version 8).
-\begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1mm}}c@{}}
+\begin{center}
+\begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1.5mm}}c@{}}
 \begin{tikzpicture}
 \begin{axis}[
 title={Graph: $\texttt{(a*)*\,b}$ and strings
 $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
 xlabel={$n$},
 \end{tabular}
 \end{center}
 \noindent This first graph shows that Python, JavaScript and Java 8 need
 approximately 30 seconds to find out that the regular expression
-$\texttt{(a*)*\,b}$ does not match strings of 28 \texttt{a}s.
+$\texttt{(a*)*\,b}$ does not match strings of 28 \texttt{a}s. Similarly,
-Similarly, the second shows that Python needs approximately 29 seconds
+the second shows that Python and Ruby need approximately 29 seconds for finding
-for finding out whether a string of 28 \texttt{a}s matches the regular
+out whether a string of 28 \texttt{a}s matches the regular expression
-expression \texttt{a?\{28\}\,a\{28\}}.  Ruby is even slightly
+\texttt{a?\{28\}\,a\{28\}}.\footnote{In this
-worse.\footnote{In this example Ruby uses the slightly different
+example Ruby uses actually the slightly different regular expression
-regular expression \texttt{a?a?a?...a?a?aaa...aa}, where the
+\texttt{a?a?a?...a?a?aaa...aa}, where the \texttt{a?} and \texttt{a}
-\texttt{a?} and \texttt{a} each occur $n$ times. More such test
+each occur $n$ times. More such test cases can be found at
-cases can be found at \url{https://www.owasp.org/index.php/Regular_expression_Denial_of_Service_-_ReDoS}.}
+\url{https://www.owasp.org/index.php/Regular_expression_Denial_of_Service_-_ReDoS}.}
 Admittedly, these regular expressions are carefully chosen to exhibit
 this exponential behaviour, but similar ones occur more often than one
 wants in ``real life''. For example, on 20 July 2016 a similar regular
 expression brought the webpage \href{http://stackexchange.com}{Stack
 Exchange} to its knees:
 \begin{center}
 \url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}
 \end{center}
 \begin{center}
 \url{http://davidvgalbraith.com/how-i-fixed-atom/}
 \end{center}
 \noindent
-and when somebody tried to match web-addresses using regular
+and also when somebody tried to match web-addresses using a
-expressions
+relatively simple regular expression
 \begin{center}
 \url{https://www.tutorialdocs.com/article/regex-trap.html}
 \end{center}
+\noindent
+Finally, on 2 July 2019 Cloudflare had a global outage because of a
+regular expression (they had no outage for the last 6 years).  What
+happened is nicely explained in the blog:
+\begin{center}
+\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019}
+\end{center}
 Such troublesome regular expressions are sometimes called \emph{evil
 regular expressions} because they have the potential to make regular
-expression matching engines to topple over, like in Python, Ruby and
+expression matching engines to topple over, like in Python, Ruby,
-Java 8. This ``toppling over'' is also sometimes called
+JavaScript and Java 8. This ``toppling over'' is also sometimes called
 \emph{catastrophic backtracking}.  I have also seen the term
 \emph{eternal matching} used for this.  The problem with evil regular
 expressions is that they can have some serious consequences, for
 example, if you use them in your web-application. The reason is that
 hackers can look for these instances where the matching engine behaves
 badly and then mount a nice DoS-attack against your application. These
 attacks are already have their own name: \emph{Regular Expression
 Denial of Service Attacks (ReDoS)}.
 It will be instructive to look behind the ``scenes'' to find
 out why Python and Ruby (and others) behave so badly when
 matching strings with evil regular expressions. But we will also look

changeset 621	cf287db8dc15
parent 618	f4818c95a32e
child 622	b47e140bcccd