handouts/ho01.tex
changeset 621 cf287db8dc15
parent 618 f4818c95a32e
child 622 b47e140bcccd
equal deleted inserted replaced
620:558f911f7802 621:cf287db8dc15
       
     1 % !TEX program = xelatex
     1 \documentclass{article}
     2 \documentclass{article}
     2 \usepackage{../style}
     3 \usepackage{../style}
     3 \usepackage{../langs}
     4 \usepackage{../langs}
     4 \usepackage{../graphics}
     5 \usepackage{../graphics}
     5 \usepackage{../data}
     6 \usepackage{../data}
    38 \begin{document}
    39 \begin{document}
    39 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2018, 2019}
    40 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2018, 2019}
    40 
    41 
    41 \section*{Handout 1}
    42 \section*{Handout 1}
    42 
    43 
    43 This module is about text processing, be it for web-crawlers,
    44 The first part of this module is about text processing, be it for
    44 compilers, dictionaries, DNA-data, ad filters and so on.  When looking
    45 web-crawlers, compilers, dictionaries, DNA-data, ad filters and so on.
    45 for a particular string, like \pcode{"foobar"} in a large text we can
    46 When looking for a particular string, say \pcode{"foobar"}, in a large
    46 use the Knuth-Morris-Pratt algorithm, which is currently the most
    47 text we can use the Knuth-Morris-Pratt algorithm, which is currently the
    47 efficient general string search algorithm. But often we do \emph{not}
    48 most efficient general string search algorithm. But often we do
    48 just look for a particular string, but for string patterns. For
    49 \emph{not} just look for a particular string, but for string patterns.
    49 example, in program code we need to identify what are the keywords
    50 For example, in program code we need to identify what are the keywords
    50 (\texttt{if}, \texttt{then}, \texttt{while}, \texttt{for}, etc), what
    51 (\texttt{if}, \texttt{then}, \texttt{while}, \texttt{for}, etc) and what
    51 are the identifiers (variable names). A pattern for identifiers could
    52 are the identifiers (variable names). A pattern for identifiers could be
    52 be stated as: they start with a letter, followed by zero or more
    53 stated as: they start with a letter, followed by zero or more letters,
    53 letters, numbers and underscores. You might also be surprised what
    54 numbers and underscores. 
    54 constraints programming languages impose about numbers: for example
    55 
    55 123 in JSON is OK, but 0123 is not.
    56 %You might also be surprised what
    56 
    57 %constraints programming languages impose about numbers: for example
    57 Often we also face the problem that we are given a string (for example
    58 %123 in JSON is OK, but 0123 is not.
    58 some user input) and want to know whether it matches a particular
    59 
       
    60 Often we also face the problem that we are given a string, for example
       
    61 some user input, and we want to know whether it matches a particular
    59 pattern---be it an email address, for example. In this way we can
    62 pattern---be it an email address, for example. In this way we can
    60 exclude user input that would otherwise have nasty effects on our
    63 exclude user input that would otherwise have nasty effects on our
    61 program (crashing it or making it go into an infinite loop, if not
    64 program (crashing it or making it go into an infinite loop, if not
    62 worse). In tools like Snort, scanning for computer viruses or
    65 worse). This kind of ``inspecting'' mechanism is implemented in popular
       
    66 network security tools such as Snort and
       
    67 Bro.\footnote{\url{www.snort.org}, \url{www.bro.org}} They scan incoming
       
    68 network traffic for computer viruses or malicious packets. Also
    63 filtering out spam usually involves scanning for some signature
    69 filtering out spam usually involves scanning for some signature
    64 (essentially a string pattern).  The point is that the fast
    70 (essentially a string pattern). The point is that the fast
    65 Knuth-Morris-Pratt algorithm for strings is not good enough for such
    71 Knuth-Morris-Pratt algorithm for strings is not good enough for such
    66 string \emph{patterns}.\smallskip
    72 string \emph{patterns}.\smallskip
    67 
    73 
    68 \defn{Regular expressions} help with conveniently specifying
    74 \defn{Regular expressions} help with conveniently specifying
    69 such patterns. The idea behind regular expressions is that
    75 such patterns. The idea behind regular expressions is that
    79 
    85 
    80 \begin{equation}\label{email}
    86 \begin{equation}\label{email}
    81 \texttt{[a-z0-9\_.-]+} \;\;\texttt{@}\;\; \texttt{[a-z0-9.-]+} \;\;\texttt{.}\;\; \texttt{[a-z.]\{2,6\}}
    87 \texttt{[a-z0-9\_.-]+} \;\;\texttt{@}\;\; \texttt{[a-z0-9.-]+} \;\;\texttt{.}\;\; \texttt{[a-z.]\{2,6\}}
    82 \end{equation}
    88 \end{equation}
    83 
    89 
    84 \noindent where the first part, the user name, matches one or more lowercase
    90 \noindent where the first part, the user name, matches one or more
    85 letters (\pcode{a-z}), digits (\pcode{0-9}), underscores, dots
    91 lowercase letters (\pcode{a-z}), digits (\pcode{0-9}), underscores, dots
    86 and hyphens. The \pcode{+} at the end of the brackets ensures
    92 and hyphens. The \pcode{+} at the end of the brackets ensures the ``one
    87 the ``one or more''. Then comes the email \pcode{@}-sign, followed
    93 or more''. Then comes the email \pcode{@}-sign, followed by the domain
    88 by the domain name which must be one or more lowercase
    94 name which must be one or more lowercase letters, digits, underscores,
    89 letters, digits, underscores, dots or hyphens. Note there
    95 dots or hyphens (but no underscores).  Finally there must be a dot
    90 cannot be an underscore in the domain name. Finally there must
    96 followed by the toplevel domain. This toplevel domain must be 2 to 6
    91 be a dot followed by the toplevel domain. This toplevel domain
    97 lowercase letters including the dot. Example strings which follow this
    92 must be 2 to 6 lowercase letters including the dot. Example
    98 pattern are:
    93 strings which follow this pattern are:
       
    94 
    99 
    95 \begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}]
   100 \begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}]
    96 niceandsimple@example.org
   101 niceandsimple@example.org
    97 very.common@example.co.uk
   102 very.common@example.co.uk
    98 a.little.lengthy.but.fine@dept.example.ac.uk
   103 a.little.lengthy.but.fine@dept.example.ac.uk
   180 
   185 
   181 \[
   186 \[
   182 \pcode{""""https?://[^"]*"""".r}
   187 \pcode{""""https?://[^"]*"""".r}
   183 \]
   188 \]
   184 
   189 
   185 \noindent Note also that the convention in Scala is that
   190 \noindent The convention in Scala is that \texttt{.r} converts a string
   186 \texttt{.r} converts a string into a regular expression. I
   191 into a regular expression. I leave it to you to ponder whether this
   187 leave it to you to ponder whether this regular expression
   192 regular expression really captures all possible web-addresses. If you
   188 really captures all possible web-addresses. If you need a quick
   193 need a quick recap about regular expressions and how the match strings,
   189 recap about regular expressions and how the match strings, 
   194 here is a quick video: \url{https://youtu.be/bgBWp9EIlMM}.
   190 here is a quick video
       
   191 \url{https://youtu.be/bgBWp9EIlMM}.
       
   192 
   195 
   193 \subsection*{Why Study Regular Expressions?}
   196 \subsection*{Why Study Regular Expressions?}
   194 
   197 
   195 Regular expressions were introduced by Kleene in the 1950ies and they
   198 Regular expressions were introduced by Kleene in the 1950ies and they
   196 have been object of intense study since then. They are nowadays pretty
   199 have been object of intense study since then. They are nowadays pretty
   197 much ubiquitous in computer science. There are many libraries
   200 much ubiquitous in computer science. There are many libraries
   198 implementing regular expressions. I am sure you have come across them
   201 implementing regular expressions. I am sure you have come across them
   199 before (remember the PRA module?). Why on earth then is there any
   202 before (remember the PRA module?). 
   200 interest in studying them again in depth in this module? Well, one
   203 
   201 answer is in the following two graphs about regular expression
   204 Why on earth then is there any interest in studying them again in depth
   202 matching in Python, Ruby, JavaScript and Java (Version 8).
   205 in this module? Well, one answer is in the following two graphs about
   203 
   206 regular expression matching in Python, Ruby, JavaScript and Java
   204 \begin{center}
   207 (Version 8).
   205 \begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1mm}}c@{}}
   208 
       
   209 \begin{center}
       
   210 \begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1.5mm}}c@{}}
   206 \begin{tikzpicture}
   211 \begin{tikzpicture}
   207 \begin{axis}[
   212 \begin{axis}[
   208     title={Graph: $\texttt{(a*)*\,b}$ and strings 
   213     title={Graph: $\texttt{(a*)*\,b}$ and strings 
   209            $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
   214            $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
   210     xlabel={$n$},
   215     xlabel={$n$},
   254 \end{tabular}
   259 \end{tabular}
   255 \end{center}
   260 \end{center}
   256 
   261 
   257 \noindent This first graph shows that Python, JavaScript and Java 8 need
   262 \noindent This first graph shows that Python, JavaScript and Java 8 need
   258 approximately 30 seconds to find out that the regular expression
   263 approximately 30 seconds to find out that the regular expression
   259 $\texttt{(a*)*\,b}$ does not match strings of 28 \texttt{a}s.
   264 $\texttt{(a*)*\,b}$ does not match strings of 28 \texttt{a}s. Similarly,
   260 Similarly, the second shows that Python needs approximately 29 seconds
   265 the second shows that Python and Ruby need approximately 29 seconds for finding
   261 for finding out whether a string of 28 \texttt{a}s matches the regular
   266 out whether a string of 28 \texttt{a}s matches the regular expression
   262 expression \texttt{a?\{28\}\,a\{28\}}.  Ruby is even slightly
   267 \texttt{a?\{28\}\,a\{28\}}.\footnote{In this
   263 worse.\footnote{In this example Ruby uses the slightly different
   268 example Ruby uses actually the slightly different regular expression
   264   regular expression \texttt{a?a?a?...a?a?aaa...aa}, where the
   269 \texttt{a?a?a?...a?a?aaa...aa}, where the \texttt{a?} and \texttt{a}
   265   \texttt{a?} and \texttt{a} each occur $n$ times. More such test
   270 each occur $n$ times. More such test cases can be found at
   266   cases can be found at \url{https://www.owasp.org/index.php/Regular_expression_Denial_of_Service_-_ReDoS}.}
   271 \url{https://www.owasp.org/index.php/Regular_expression_Denial_of_Service_-_ReDoS}.}
   267 Admittedly, these regular expressions are carefully chosen to exhibit
   272 Admittedly, these regular expressions are carefully chosen to exhibit
   268 this exponential behaviour, but similar ones occur more often than one
   273 this exponential behaviour, but similar ones occur more often than one
   269 wants in ``real life''. For example, on 20 July 2016 a similar regular
   274 wants in ``real life''. For example, on 20 July 2016 a similar regular
   270 expression brought the webpage \href{http://stackexchange.com}{Stack
   275 expression brought the webpage \href{http://stackexchange.com}{Stack
   271   Exchange} to its knees:
   276 Exchange} to its knees:
   272 
   277 
   273 \begin{center}
   278 \begin{center}
   274 \url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}
   279 \url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}
   275 \end{center}
   280 \end{center}
   276 
   281 
   287 \begin{center}
   292 \begin{center}
   288 \url{http://davidvgalbraith.com/how-i-fixed-atom/}
   293 \url{http://davidvgalbraith.com/how-i-fixed-atom/}
   289 \end{center}
   294 \end{center}
   290 
   295 
   291 \noindent
   296 \noindent
   292 and when somebody tried to match web-addresses using regular
   297 and also when somebody tried to match web-addresses using a 
   293 expressions
   298 relatively simple regular expression
   294 
   299 
   295 \begin{center}
   300 \begin{center}
   296 \url{https://www.tutorialdocs.com/article/regex-trap.html}
   301 \url{https://www.tutorialdocs.com/article/regex-trap.html}
   297 \end{center}  
   302 \end{center}  
   298 
   303 
       
   304 \noindent
       
   305 Finally, on 2 July 2019 Cloudflare had a global outage because of a 
       
   306 regular expression (they had no outage for the last 6 years).  What
       
   307 happened is nicely explained in the blog:
       
   308 
       
   309 \begin{center}
       
   310 \url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019}
       
   311 \end{center}  
   299 
   312 
   300 Such troublesome regular expressions are sometimes called \emph{evil
   313 Such troublesome regular expressions are sometimes called \emph{evil
   301   regular expressions} because they have the potential to make regular
   314   regular expressions} because they have the potential to make regular
   302 expression matching engines to topple over, like in Python, Ruby and
   315   expression matching engines to topple over, like in Python, Ruby,
   303 Java 8. This ``toppling over'' is also sometimes called
   316   JavaScript and Java 8. This ``toppling over'' is also sometimes called
   304 \emph{catastrophic backtracking}.  I have also seen the term
   317   \emph{catastrophic backtracking}.  I have also seen the term
   305 \emph{eternal matching} used for this.  The problem with evil regular
   318   \emph{eternal matching} used for this.  The problem with evil regular
   306 expressions is that they can have some serious consequences, for
   319   expressions is that they can have some serious consequences, for
   307 example, if you use them in your web-application. The reason is that
   320   example, if you use them in your web-application. The reason is that
   308 hackers can look for these instances where the matching engine behaves
   321   hackers can look for these instances where the matching engine behaves
   309 badly and then mount a nice DoS-attack against your application. These
   322   badly and then mount a nice DoS-attack against your application. These
   310 attacks are already have their own name: \emph{Regular Expression
   323   attacks are already have their own name: \emph{Regular Expression
   311   Denial of Service Attacks (ReDoS)}.
   324   Denial of Service Attacks (ReDoS)}.
   312 
   325 
   313 It will be instructive to look behind the ``scenes'' to find
   326 It will be instructive to look behind the ``scenes'' to find
   314 out why Python and Ruby (and others) behave so badly when
   327 out why Python and Ruby (and others) behave so badly when
   315 matching strings with evil regular expressions. But we will also look
   328 matching strings with evil regular expressions. But we will also look