38 \begin{document} |
39 \begin{document} |
39 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2018, 2019} |
40 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2018, 2019} |
40 |
41 |
41 \section*{Handout 1} |
42 \section*{Handout 1} |
42 |
43 |
43 This module is about text processing, be it for web-crawlers, |
44 The first part of this module is about text processing, be it for |
44 compilers, dictionaries, DNA-data, ad filters and so on. When looking |
45 web-crawlers, compilers, dictionaries, DNA-data, ad filters and so on. |
45 for a particular string, like \pcode{"foobar"} in a large text we can |
46 When looking for a particular string, say \pcode{"foobar"}, in a large |
46 use the Knuth-Morris-Pratt algorithm, which is currently the most |
47 text we can use the Knuth-Morris-Pratt algorithm, which is currently the |
47 efficient general string search algorithm. But often we do \emph{not} |
48 most efficient general string search algorithm. But often we do |
48 just look for a particular string, but for string patterns. For |
49 \emph{not} just look for a particular string, but for string patterns. |
49 example, in program code we need to identify what are the keywords |
50 For example, in program code we need to identify what are the keywords |
50 (\texttt{if}, \texttt{then}, \texttt{while}, \texttt{for}, etc), what |
51 (\texttt{if}, \texttt{then}, \texttt{while}, \texttt{for}, etc) and what |
51 are the identifiers (variable names). A pattern for identifiers could |
52 are the identifiers (variable names). A pattern for identifiers could be |
52 be stated as: they start with a letter, followed by zero or more |
53 stated as: they start with a letter, followed by zero or more letters, |
53 letters, numbers and underscores. You might also be surprised what |
54 numbers and underscores. |
54 constraints programming languages impose about numbers: for example |
55 |
55 123 in JSON is OK, but 0123 is not. |
56 %You might also be surprised what |
56 |
57 %constraints programming languages impose about numbers: for example |
57 Often we also face the problem that we are given a string (for example |
58 %123 in JSON is OK, but 0123 is not. |
58 some user input) and want to know whether it matches a particular |
59 |
|
60 Often we also face the problem that we are given a string, for example |
|
61 some user input, and we want to know whether it matches a particular |
59 pattern---be it an email address, for example. In this way we can |
62 pattern---be it an email address, for example. In this way we can |
60 exclude user input that would otherwise have nasty effects on our |
63 exclude user input that would otherwise have nasty effects on our |
61 program (crashing it or making it go into an infinite loop, if not |
64 program (crashing it or making it go into an infinite loop, if not |
62 worse). In tools like Snort, scanning for computer viruses or |
65 worse). This kind of ``inspecting'' mechanism is implemented in popular |
|
66 network security tools such as Snort and |
|
67 Bro.\footnote{\url{www.snort.org}, \url{www.bro.org}} They scan incoming |
|
68 network traffic for computer viruses or malicious packets. Also |
63 filtering out spam usually involves scanning for some signature |
69 filtering out spam usually involves scanning for some signature |
64 (essentially a string pattern). The point is that the fast |
70 (essentially a string pattern). The point is that the fast |
65 Knuth-Morris-Pratt algorithm for strings is not good enough for such |
71 Knuth-Morris-Pratt algorithm for strings is not good enough for such |
66 string \emph{patterns}.\smallskip |
72 string \emph{patterns}.\smallskip |
67 |
73 |
68 \defn{Regular expressions} help with conveniently specifying |
74 \defn{Regular expressions} help with conveniently specifying |
69 such patterns. The idea behind regular expressions is that |
75 such patterns. The idea behind regular expressions is that |
79 |
85 |
80 \begin{equation}\label{email} |
86 \begin{equation}\label{email} |
81 \texttt{[a-z0-9\_.-]+} \;\;\texttt{@}\;\; \texttt{[a-z0-9.-]+} \;\;\texttt{.}\;\; \texttt{[a-z.]\{2,6\}} |
87 \texttt{[a-z0-9\_.-]+} \;\;\texttt{@}\;\; \texttt{[a-z0-9.-]+} \;\;\texttt{.}\;\; \texttt{[a-z.]\{2,6\}} |
82 \end{equation} |
88 \end{equation} |
83 |
89 |
84 \noindent where the first part, the user name, matches one or more lowercase |
90 \noindent where the first part, the user name, matches one or more |
85 letters (\pcode{a-z}), digits (\pcode{0-9}), underscores, dots |
91 lowercase letters (\pcode{a-z}), digits (\pcode{0-9}), underscores, dots |
86 and hyphens. The \pcode{+} at the end of the brackets ensures |
92 and hyphens. The \pcode{+} at the end of the brackets ensures the ``one |
87 the ``one or more''. Then comes the email \pcode{@}-sign, followed |
93 or more''. Then comes the email \pcode{@}-sign, followed by the domain |
88 by the domain name which must be one or more lowercase |
94 name which must be one or more lowercase letters, digits, underscores, |
89 letters, digits, underscores, dots or hyphens. Note there |
95 dots or hyphens (but no underscores). Finally there must be a dot |
90 cannot be an underscore in the domain name. Finally there must |
96 followed by the toplevel domain. This toplevel domain must be 2 to 6 |
91 be a dot followed by the toplevel domain. This toplevel domain |
97 lowercase letters including the dot. Example strings which follow this |
92 must be 2 to 6 lowercase letters including the dot. Example |
98 pattern are: |
93 strings which follow this pattern are: |
|
94 |
99 |
95 \begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}] |
100 \begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}] |
96 niceandsimple@example.org |
101 niceandsimple@example.org |
97 very.common@example.co.uk |
102 very.common@example.co.uk |
98 a.little.lengthy.but.fine@dept.example.ac.uk |
103 a.little.lengthy.but.fine@dept.example.ac.uk |
180 |
185 |
181 \[ |
186 \[ |
182 \pcode{""""https?://[^"]*"""".r} |
187 \pcode{""""https?://[^"]*"""".r} |
183 \] |
188 \] |
184 |
189 |
185 \noindent Note also that the convention in Scala is that |
190 \noindent The convention in Scala is that \texttt{.r} converts a string |
186 \texttt{.r} converts a string into a regular expression. I |
191 into a regular expression. I leave it to you to ponder whether this |
187 leave it to you to ponder whether this regular expression |
192 regular expression really captures all possible web-addresses. If you |
188 really captures all possible web-addresses. If you need a quick |
193 need a quick recap about regular expressions and how the match strings, |
189 recap about regular expressions and how the match strings, |
194 here is a quick video: \url{https://youtu.be/bgBWp9EIlMM}. |
190 here is a quick video |
|
191 \url{https://youtu.be/bgBWp9EIlMM}. |
|
192 |
195 |
193 \subsection*{Why Study Regular Expressions?} |
196 \subsection*{Why Study Regular Expressions?} |
194 |
197 |
195 Regular expressions were introduced by Kleene in the 1950ies and they |
198 Regular expressions were introduced by Kleene in the 1950ies and they |
196 have been object of intense study since then. They are nowadays pretty |
199 have been object of intense study since then. They are nowadays pretty |
197 much ubiquitous in computer science. There are many libraries |
200 much ubiquitous in computer science. There are many libraries |
198 implementing regular expressions. I am sure you have come across them |
201 implementing regular expressions. I am sure you have come across them |
199 before (remember the PRA module?). Why on earth then is there any |
202 before (remember the PRA module?). |
200 interest in studying them again in depth in this module? Well, one |
203 |
201 answer is in the following two graphs about regular expression |
204 Why on earth then is there any interest in studying them again in depth |
202 matching in Python, Ruby, JavaScript and Java (Version 8). |
205 in this module? Well, one answer is in the following two graphs about |
203 |
206 regular expression matching in Python, Ruby, JavaScript and Java |
204 \begin{center} |
207 (Version 8). |
205 \begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1mm}}c@{}} |
208 |
|
209 \begin{center} |
|
210 \begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1.5mm}}c@{}} |
206 \begin{tikzpicture} |
211 \begin{tikzpicture} |
207 \begin{axis}[ |
212 \begin{axis}[ |
208 title={Graph: $\texttt{(a*)*\,b}$ and strings |
213 title={Graph: $\texttt{(a*)*\,b}$ and strings |
209 $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$}, |
214 $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$}, |
210 xlabel={$n$}, |
215 xlabel={$n$}, |
254 \end{tabular} |
259 \end{tabular} |
255 \end{center} |
260 \end{center} |
256 |
261 |
257 \noindent This first graph shows that Python, JavaScript and Java 8 need |
262 \noindent This first graph shows that Python, JavaScript and Java 8 need |
258 approximately 30 seconds to find out that the regular expression |
263 approximately 30 seconds to find out that the regular expression |
259 $\texttt{(a*)*\,b}$ does not match strings of 28 \texttt{a}s. |
264 $\texttt{(a*)*\,b}$ does not match strings of 28 \texttt{a}s. Similarly, |
260 Similarly, the second shows that Python needs approximately 29 seconds |
265 the second shows that Python and Ruby need approximately 29 seconds for finding |
261 for finding out whether a string of 28 \texttt{a}s matches the regular |
266 out whether a string of 28 \texttt{a}s matches the regular expression |
262 expression \texttt{a?\{28\}\,a\{28\}}. Ruby is even slightly |
267 \texttt{a?\{28\}\,a\{28\}}.\footnote{In this |
263 worse.\footnote{In this example Ruby uses the slightly different |
268 example Ruby uses actually the slightly different regular expression |
264 regular expression \texttt{a?a?a?...a?a?aaa...aa}, where the |
269 \texttt{a?a?a?...a?a?aaa...aa}, where the \texttt{a?} and \texttt{a} |
265 \texttt{a?} and \texttt{a} each occur $n$ times. More such test |
270 each occur $n$ times. More such test cases can be found at |
266 cases can be found at \url{https://www.owasp.org/index.php/Regular_expression_Denial_of_Service_-_ReDoS}.} |
271 \url{https://www.owasp.org/index.php/Regular_expression_Denial_of_Service_-_ReDoS}.} |
267 Admittedly, these regular expressions are carefully chosen to exhibit |
272 Admittedly, these regular expressions are carefully chosen to exhibit |
268 this exponential behaviour, but similar ones occur more often than one |
273 this exponential behaviour, but similar ones occur more often than one |
269 wants in ``real life''. For example, on 20 July 2016 a similar regular |
274 wants in ``real life''. For example, on 20 July 2016 a similar regular |
270 expression brought the webpage \href{http://stackexchange.com}{Stack |
275 expression brought the webpage \href{http://stackexchange.com}{Stack |
271 Exchange} to its knees: |
276 Exchange} to its knees: |
272 |
277 |
273 \begin{center} |
278 \begin{center} |
274 \url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016} |
279 \url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016} |
275 \end{center} |
280 \end{center} |
276 |
281 |
287 \begin{center} |
292 \begin{center} |
288 \url{http://davidvgalbraith.com/how-i-fixed-atom/} |
293 \url{http://davidvgalbraith.com/how-i-fixed-atom/} |
289 \end{center} |
294 \end{center} |
290 |
295 |
291 \noindent |
296 \noindent |
292 and when somebody tried to match web-addresses using regular |
297 and also when somebody tried to match web-addresses using a |
293 expressions |
298 relatively simple regular expression |
294 |
299 |
295 \begin{center} |
300 \begin{center} |
296 \url{https://www.tutorialdocs.com/article/regex-trap.html} |
301 \url{https://www.tutorialdocs.com/article/regex-trap.html} |
297 \end{center} |
302 \end{center} |
298 |
303 |
|
304 \noindent |
|
305 Finally, on 2 July 2019 Cloudflare had a global outage because of a |
|
306 regular expression (they had no outage for the last 6 years). What |
|
307 happened is nicely explained in the blog: |
|
308 |
|
309 \begin{center} |
|
310 \url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019} |
|
311 \end{center} |
299 |
312 |
300 Such troublesome regular expressions are sometimes called \emph{evil |
313 Such troublesome regular expressions are sometimes called \emph{evil |
301 regular expressions} because they have the potential to make regular |
314 regular expressions} because they have the potential to make regular |
302 expression matching engines to topple over, like in Python, Ruby and |
315 expression matching engines to topple over, like in Python, Ruby, |
303 Java 8. This ``toppling over'' is also sometimes called |
316 JavaScript and Java 8. This ``toppling over'' is also sometimes called |
304 \emph{catastrophic backtracking}. I have also seen the term |
317 \emph{catastrophic backtracking}. I have also seen the term |
305 \emph{eternal matching} used for this. The problem with evil regular |
318 \emph{eternal matching} used for this. The problem with evil regular |
306 expressions is that they can have some serious consequences, for |
319 expressions is that they can have some serious consequences, for |
307 example, if you use them in your web-application. The reason is that |
320 example, if you use them in your web-application. The reason is that |
308 hackers can look for these instances where the matching engine behaves |
321 hackers can look for these instances where the matching engine behaves |
309 badly and then mount a nice DoS-attack against your application. These |
322 badly and then mount a nice DoS-attack against your application. These |
310 attacks are already have their own name: \emph{Regular Expression |
323 attacks are already have their own name: \emph{Regular Expression |
311 Denial of Service Attacks (ReDoS)}. |
324 Denial of Service Attacks (ReDoS)}. |
312 |
325 |
313 It will be instructive to look behind the ``scenes'' to find |
326 It will be instructive to look behind the ``scenes'' to find |
314 out why Python and Ruby (and others) behave so badly when |
327 out why Python and Ruby (and others) behave so badly when |
315 matching strings with evil regular expressions. But we will also look |
328 matching strings with evil regular expressions. But we will also look |