621
+ − 1
% !TEX program = xelatex
105
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
+ − 2
\documentclass{article}
237
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 3
\usepackage{../style}
217
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 4
\usepackage{../langs}
871
+ − 5
\usepackage{../graphicss}
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 6
\usepackage{../data}
477
+ − 7
\usepackage{lstlinebgrd}
+ − 8
\definecolor{capri}{rgb}{0.0, 0.75, 1.0}
108
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 9
306
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 10
%%http://regexcrossword.com/challenges/cities/puzzles/1
398
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 11
%%https://jex.im/regulex/
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 12
%%https://www.reddit.com/r/ProgrammingLanguages/comments/42dlem/mona_compiler_development_part_2_parsing/
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 13
%%https://www.reddit.com/r/ProgrammingLanguages/comments/43wlkq/formal_grammar_for_csh_tsch_sh_or_bash/
112
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 14
403
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 15
%% regex displayers
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 16
%% https://regexper.com/#a%7Ca
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 17
%% https://www.debuggex.com
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 18
%% https://jex.im/regulex/#!embed=false&flags=&re=%5E(a%7Cb)*%3F%24
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 19
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 20
%% email validator
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 21
%% http://www.ex-parrot.com/%7Epdw/Mail-RFC822-Address.html
496
+ − 22
% https://jackfoxy.github.io/FsRegEx/emailregex.html
+ − 23
403
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 24
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 25
%% regex testers
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 26
% https://regex101.com
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 27
% http://regexr.com
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 28
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 29
%% emacs regexes
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 30
%% https://www.gnu.org/software/emacs/manual/html_node/elisp/Regular-Expressions.html
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 31
872
+ − 32
%% reasons for a new programming language
403
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 33
%% http://beautifulracket.com
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 34
418
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 35
% compiler explorer
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 36
% https://gcc.godbolt.org
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 37
622
+ − 38
716
+ − 39
% good article how languages have been influenced
+ − 40
% 10 MOST(LY DEAD) INFLUENTIAL PROGRAMMING LANGUAGES
+ − 41
% https://www.hillelwayne.com/post/influential-dead-languages/
+ − 42
622
+ − 43
105
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
+ − 44
\begin{document}
924
+ − 45
\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2023}
105
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
+ − 46
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
+ − 47
\section*{Handout 1}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
+ − 48
722
+ − 49
The purpose of a compiler is to transform a program a human can read and
830
+ − 50
write into code machines can run as fast as possible. Developing a
722
+ − 51
compiler is an old craft going back to 1952 with the first compiler
+ − 52
written by Grace Hopper.\footnote{Who many years ago was invited on a
743
+ − 53
talk show hosted by David Letterman.
745
+ − 54
\here{https://youtu.be/3N_ywhx6_K0?t=31}} Why studying compilers
924
+ − 55
nowadays? An interesting answer is given by John Regehr in his compiler
743
+ − 56
blog:\here{http://blog.regehr.org/archives/1419}
690
+ − 57
+ − 58
\begin{quote}\it{}``We can start off with a couple of observations
+ − 59
about the role of compilers. First, hardware is getting weirder
+ − 60
rather than getting clocked faster: almost all processors are
+ − 61
multicores and it looks like there is increasing asymmetry in
+ − 62
resources across cores. Processors come with vector units, crypto
+ − 63
accelerators, bit twiddling instructions, and lots of features to
+ − 64
make virtualization and concurrency work. We have DSPs, GPUs,
+ − 65
big.little, and Xeon Phi. This is only scratching the
+ − 66
surface. Second, we’re getting tired of low-level languages and
+ − 67
their associated security disasters, we want to write new code, to
+ − 68
whatever extent possible, in safer, higher-level
+ − 69
languages. Compilers are caught right in the middle of these
+ − 70
opposing trends: one of their main jobs is to help bridge the large
+ − 71
and growing gap between increasingly high-level languages and
+ − 72
increasingly wacky platforms. It’s effectively a perpetual
+ − 73
employment act for solid compiler hackers.''
+ − 74
\end{quote}
+ − 75
722
+ − 76
\noindent
872
+ − 77
Given this, the goal of this module is to become a solid (beginner) compiler
+ − 78
hacker and as part of the coursework to implement two small
+ − 79
compilers for two very small programming languages.
690
+ − 80
722
+ − 81
The first part of the module is about the problem of text processing,
+ − 82
which is needed for compilers, but also for dictionaries, DNA-data,
+ − 83
spam-filters and so on. When looking for a particular string, say
+ − 84
\pcode{"foobar"}, in a large text we can use the Knuth-Morris-Pratt
+ − 85
algorithm, which is currently the most efficient general string search
+ − 86
algorithm. But often we do \emph{not} just look for a particular string,
+ − 87
but for string patterns. For example, in program code we need to
+ − 88
identify what are the keywords (\texttt{if}, \texttt{then},
+ − 89
\texttt{while}, \texttt{for}, etc) and what are the identifiers
+ − 90
(variable names). A pattern for identifiers could be stated as: they
+ − 91
start with a letter, followed by zero or more letters, numbers and
+ − 92
underscores.
618
+ − 93
621
+ − 94
%You might also be surprised what
+ − 95
%constraints programming languages impose about numbers: for example
+ − 96
%123 in JSON is OK, but 0123 is not.
706
+ − 97
%
+ − 98
% The regex for JASON numbers is
+ − 99
%
+ − 100
% -?(0|[1-9][0-9]*)(\.[0-9]+)?([eE][+-]?[0-9]+)?
621
+ − 101
+ − 102
Often we also face the problem that we are given a string, for example
+ − 103
some user input, and we want to know whether it matches a particular
622
+ − 104
pattern---is it an email address, for example. In this way we can
618
+ − 105
exclude user input that would otherwise have nasty effects on our
+ − 106
program (crashing it or making it go into an infinite loop, if not
872
+ − 107
worse). This kind of ``vetting'' mechanism is also implemented in
622
+ − 108
popular network security tools such as Snort and
872
+ − 109
Zeek.\here{www.snort.org}\here{www.bro.org} They scan incoming
622
+ − 110
network traffic for computer viruses or malicious packets. Similarly
+ − 111
filtering out spam usually involves looking for some signature
621
+ − 112
(essentially a string pattern). The point is that the fast
618
+ − 113
Knuth-Morris-Pratt algorithm for strings is not good enough for such
+ − 114
string \emph{patterns}.\smallskip
291
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 115
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 116
\defn{Regular expressions} help with conveniently specifying
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 117
such patterns. The idea behind regular expressions is that
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 118
they are a simple method for describing languages (or sets of
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 119
strings)\ldots at least languages we are interested in in
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 120
computer science. For example there is no convenient regular
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 121
expression for describing the English language short of
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 122
enumerating all English words. But they seem useful for
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 123
describing for example simple email addresses.\footnote{See
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 124
``8 Regular Expressions You Should Know''
291
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 125
\url{http://goo.gl/5LoVX7}} Consider the following regular
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 126
expression
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 127
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 128
\begin{equation}\label{email}
416
+ − 129
\texttt{[a-z0-9\_.-]+} \;\;\texttt{@}\;\; \texttt{[a-z0-9.-]+} \;\;\texttt{.}\;\; \texttt{[a-z.]\{2,6\}}
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 130
\end{equation}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 131
621
+ − 132
\noindent where the first part, the user name, matches one or more
+ − 133
lowercase letters (\pcode{a-z}), digits (\pcode{0-9}), underscores, dots
+ − 134
and hyphens. The \pcode{+} at the end of the brackets ensures the ``one
+ − 135
or more''. Then comes the email \pcode{@}-sign, followed by the domain
+ − 136
name which must be one or more lowercase letters, digits, underscores,
+ − 137
dots or hyphens (but no underscores). Finally there must be a dot
+ − 138
followed by the toplevel domain. This toplevel domain must be 2 to 6
+ − 139
lowercase letters including the dot. Example strings which follow this
+ − 140
pattern are:
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 141
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 142
\begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}]
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 143
niceandsimple@example.org
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 144
very.common@example.co.uk
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 145
a.little.lengthy.but.fine@dept.example.ac.uk
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 146
other.email-with-dash@example.edu
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 147
\end{lstlisting}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 148
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 149
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 150
\noindent
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 151
But for example the following two do not
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 152
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 153
\begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}]
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 154
user@localserver
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 155
disposable.style.email.with+symbol@example.com
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 156
\end{lstlisting}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 157
471
+ − 158
\noindent according to the regular expression we specified in line
+ − 159
\eqref{email} above. Whether this is intended or not is a different
+ − 160
question (the second email above is actually an acceptable email
550
+ − 161
address according to the RFC 5322 standard for email addresses).
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 162
327
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 163
As mentioned above, identifiers, or variables, in program code
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 164
are often required to satisfy the constraints that they start
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 165
with a letter and then can be followed by zero or more letters
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 166
or numbers and also can include underscores, but not as the
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 167
first character. Such identifiers can be recognised with the
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 168
regular expression
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 169
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 170
\begin{center}
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 171
\pcode{[a-zA-Z] [a-zA-Z0-9_]*}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 172
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 173
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 174
\noindent Possible identifiers that match this regular expression
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 175
are \pcode{x}, \pcode{foo}, \pcode{foo_bar_1}, \pcode{A_very_42_long_object_name},
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 176
but not \pcode{_i} and also not \pcode{4you}.
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 177
404
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 178
Many programming languages offer libraries that can be used to
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 179
validate such strings against regular expressions. Also there
248
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 180
are some common, and I am sure very familiar, ways of how to
404
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 181
construct regular expressions. For example in Scala we have
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 182
a library implementing the following regular expressions:
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 183
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 184
\begin{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 185
\begin{tabular}{lp{9cm}}
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 186
\pcode{re*} & matches 0 or more occurrences of preceding
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 187
expression\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 188
\pcode{re+} & matches 1 or more occurrences of preceding
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 189
expression\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 190
\pcode{re?} & matches 0 or 1 occurrence of preceding
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 191
expression\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 192
\pcode{re\{n\}} & matches exactly \pcode{n} number of
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 193
occurrences of preceding expression\\
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 194
\pcode{re\{n,m\}} & matches at least \pcode{n} and at most {\tt m}
550
+ − 195
occurrences of the preceding expression\\
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 196
\pcode{[...]} & matches any single character inside the
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 197
brackets\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 198
\pcode{[^...]} & matches any single character not inside the
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 199
brackets\\
250
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 200
\pcode{...-...} & character ranges\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 201
\pcode{\\d} & matches digits; equivalent to \pcode{[0-9]}\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 202
\pcode{.} & matches every character except newline\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 203
\pcode{(re)} & groups regular expressions and remembers
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 204
matched text
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 205
\end{tabular}
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 206
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 207
925
+ − 208
\noindent
924
+ − 209
The syntax is pretty universal and can be found in many regular
+ − 210
expression libraries. If you need a quick recap about regular
+ − 211
expressions and how the match strings, here is a quick video:
+ − 212
\url{https://youtu.be/bgBWp9EIlMM}.
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 213
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 214
\subsection*{Why Study Regular Expressions?}
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 215
471
+ − 216
Regular expressions were introduced by Kleene in the 1950ies and they
+ − 217
have been object of intense study since then. They are nowadays pretty
+ − 218
much ubiquitous in computer science. There are many libraries
+ − 219
implementing regular expressions. I am sure you have come across them
622
+ − 220
before (remember the PRA or PEP modules?).
621
+ − 221
+ − 222
Why on earth then is there any interest in studying them again in depth
+ − 223
in this module? Well, one answer is in the following two graphs about
924
+ − 224
regular expression matching in Python, Ruby, JavaScript, Swift and Java
621
+ − 225
(Version 8).
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 226
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 227
\begin{center}
621
+ − 228
\begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1.5mm}}c@{}}
477
+ − 229
\begin{tikzpicture}
+ − 230
\begin{axis}[
+ − 231
title={Graph: $\texttt{(a*)*\,b}$ and strings
+ − 232
$\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
+ − 233
xlabel={$n$},
+ − 234
x label style={at={(1.05,0.0)}},
+ − 235
ylabel={time in secs},
+ − 236
enlargelimits=false,
+ − 237
xtick={0,5,...,30},
+ − 238
xmax=33,
+ − 239
ymax=35,
+ − 240
ytick={0,5,...,30},
+ − 241
scaled ticks=false,
+ − 242
axis lines=left,
+ − 243
width=5.5cm,
+ − 244
height=4.5cm,
924
+ − 245
legend entries={Python, Java 8, JavaScript, Swift},
477
+ − 246
legend pos=north west,
+ − 247
legend cell align=left]
+ − 248
\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
+ − 249
\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
618
+ − 250
\addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
924
+ − 251
\addplot[magenta,mark=*, mark options={fill=white}] table {re-swift.data};
477
+ − 252
\end{axis}
+ − 253
\end{tikzpicture}
+ − 254
&
291
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 255
\begin{tikzpicture}
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 256
\begin{axis}[
415
+ − 257
title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings
+ − 258
$\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
+ − 259
xlabel={$n$},
+ − 260
x label style={at={(1.05,0.0)}},
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 261
ylabel={time in secs},
291
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 262
enlargelimits=false,
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 263
xtick={0,5,...,30},
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 264
xmax=33,
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 265
ymax=35,
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 266
ytick={0,5,...,30},
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 267
scaled ticks=false,
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 268
axis lines=left,
415
+ − 269
width=5.5cm,
318
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 270
height=4.5cm,
291
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 271
legend entries={Python,Ruby},
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 272
legend pos=north west,
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 273
legend cell align=left]
448
+ − 274
\addplot[blue,mark=*, mark options={fill=white}] table {re-python.data};
+ − 275
\addplot[brown,mark=triangle*, mark options={fill=white}] table {re-ruby.data};
291
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 276
\end{axis}
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 277
\end{tikzpicture}
415
+ − 278
\end{tabular}
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 279
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 280
924
+ − 281
\noindent The first graph shows that Python, JavaScript, Swift and Java 8 need
477
+ − 282
approximately 30 seconds to find out that the regular expression
621
+ − 283
$\texttt{(a*)*\,b}$ does not match strings of 28 \texttt{a}s. Similarly,
+ − 284
the second shows that Python and Ruby need approximately 29 seconds for finding
+ − 285
out whether a string of 28 \texttt{a}s matches the regular expression
+ − 286
\texttt{a?\{28\}\,a\{28\}}.\footnote{In this
+ − 287
example Ruby uses actually the slightly different regular expression
+ − 288
\texttt{a?a?a?...a?a?aaa...aa}, where the \texttt{a?} and \texttt{a}
+ − 289
each occur $n$ times. More such test cases can be found at
+ − 290
\url{https://www.owasp.org/index.php/Regular_expression_Denial_of_Service_-_ReDoS}.}
477
+ − 291
Admittedly, these regular expressions are carefully chosen to exhibit
+ − 292
this exponential behaviour, but similar ones occur more often than one
+ − 293
wants in ``real life''. For example, on 20 July 2016 a similar regular
+ − 294
expression brought the webpage \href{http://stackexchange.com}{Stack
621
+ − 295
Exchange} to its knees:
407
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 296
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 297
\begin{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 298
\url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 299
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 300
417
+ − 301
\noindent I can also highly recommend a cool webtalk from an engineer
+ − 302
from Stack Exchange on the same subject:
+ − 303
+ − 304
\begin{center}
+ − 305
\url{https://vimeo.com/112065252}
+ − 306
\end{center}
+ − 307
407
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 308
\noindent
550
+ − 309
A similar problem also occurred in the Atom editor:
416
+ − 310
+ − 311
\begin{center}
+ − 312
\url{http://davidvgalbraith.com/how-i-fixed-atom/}
+ − 313
\end{center}
+ − 314
563
+ − 315
\noindent
621
+ − 316
and also when somebody tried to match web-addresses using a
+ − 317
relatively simple regular expression
554
+ − 318
+ − 319
\begin{center}
924
+ − 320
\url{https://archive.ph/W5Ogx#selection-141.1-141.36}
554
+ − 321
\end{center}
+ − 322
621
+ − 323
\noindent
+ − 324
Finally, on 2 July 2019 Cloudflare had a global outage because of a
830
+ − 325
regular expression (they had no outage for the 6 years before). What
621
+ − 326
happened is nicely explained in the blog:
+ − 327
+ − 328
\begin{center}
+ − 329
\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019}
+ − 330
\end{center}
563
+ − 331
415
+ − 332
Such troublesome regular expressions are sometimes called \emph{evil
722
+ − 333
regular expressions} because they have the potential to make regular
+ − 334
expression matching engines to topple over, like in Python, Ruby,
+ − 335
JavaScript and Java 8. This ``toppling over'' is also sometimes called
+ − 336
\emph{catastrophic backtracking}. I have also seen the term
+ − 337
\emph{eternal matching} used for this. The problem with evil regular
+ − 338
expressions and catastrophic backtracking is that they can have some
+ − 339
serious consequences, for example, if you use them in your
+ − 340
web-application. The reason is that hackers can look for these instances
+ − 341
where the matching engine behaves badly and then mount a nice DoS-attack
+ − 342
against your application. These attacks are already have their own name:
+ − 343
\emph{Regular Expression Denial of Service Attacks (ReDoS)}.
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 344
622
+ − 345
It will be instructive to look behind the ``scenes'' to find out why
+ − 346
Python and Ruby (and others) behave so badly when matching strings with
+ − 347
evil regular expressions. But we will also look at a relatively simple
+ − 348
algorithm that solves this problem much better than Python and Ruby
+ − 349
do\ldots actually it will be two versions of the algorithm: the first
+ − 350
one will be able in the example \texttt{a?\{n\}\,a\{n\}} to process strings of
+ − 351
approximately 1,100 \texttt{a}s in 23 seconds, while the second version
+ − 352
will even be able to process up to 11,000(!) in 5 seconds, see the graph
+ − 353
below:
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 354
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 355
\begin{center}
291
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 356
\begin{tikzpicture}
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 357
\begin{axis}[
415
+ − 358
title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings
+ − 359
$\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
+ − 360
xlabel={$n$},
+ − 361
x label style={at={(1.05,0.0)}},
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 362
ylabel={time in secs},
291
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 363
enlargelimits=false,
477
+ − 364
xtick={0,3000,...,12000},
+ − 365
xmax=13000,
443
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 366
ymax=32,
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 367
ytick={0,5,...,30},
291
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 368
scaled ticks=false,
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 369
axis lines=left,
415
+ − 370
width=7cm,
622
+ − 371
height=4.4cm,
415
+ − 372
legend entries={Our Algorithm V1, Our Algorithm V2},
+ − 373
legend pos=outer north east]
+ − 374
\addplot[green,mark=square*,mark options={fill=white}] table {re2.data};
291
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 375
\addplot[black,mark=square*,mark options={fill=white}] table {re3.data};
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 376
\end{axis}
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 377
\end{tikzpicture}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 378
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 379
415
+ − 380
\noindent And in the case of the regular expression $\texttt{(a*)*\,b}$
563
+ − 381
and strings of \texttt{a}s we will beat Java 8 by factor of
415
+ − 382
approximately 1,000,000 (note the scale on the $x$-axis).
+ − 383
+ − 384
\begin{center}
+ − 385
\begin{tikzpicture}
+ − 386
\begin{axis}[
+ − 387
title={Graph: $\texttt{(a*)*\,b}$ and strings
+ − 388
$\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
+ − 389
xlabel={$n$},
+ − 390
x label style={at={(1.05,0.0)}},
+ − 391
ylabel={time in secs},
+ − 392
enlargelimits=false,
443
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 393
ymax=32,
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 394
ytick={0,5,...,30},
415
+ − 395
axis lines=left,
+ − 396
width=7cm,
+ − 397
height=4.5cm,
+ − 398
legend entries={Our Algorithm V2},
+ − 399
legend pos=outer north east]
+ − 400
\addplot[black,mark=square*,mark options={fill=white}] table {re3a.data};
+ − 401
\end{axis}
+ − 402
\end{tikzpicture}
+ − 403
\end{center}
+ − 404
722
+ − 405
\noindent
+ − 406
You might have wondered above why I looked at the (now) old Java 8: the
+ − 407
reason is that Java 9 and later versions are a bit better, but we will
+ − 408
still beat them hands down with our regex matcher.
+ − 409
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 410
\subsection*{Basic Regular Expressions}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 411
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 412
The regular expressions shown earlier for Scala, we
925
+ − 413
will in this module call \emph{extended regular expressions}. The ones we
+ − 414
will mainly study are \emph{basic regular
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 415
expressions}, which by convention we will just call
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 416
\emph{regular expressions}, if it is clear what we mean. The
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 417
attraction of (basic) regular expressions is that many
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 418
features of the extended ones are just syntactic sugar.
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 419
(Basic) regular expressions are defined by the following
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 420
grammar:
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 421
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 422
\begin{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 423
\begin{tabular}{r@{\hspace{1mm}}r@{\hspace{1mm}}l@{\hspace{13mm}}l}
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 424
$r$ & $::=$ & $\ZERO$ & null language\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 425
& $\mid$ & $\ONE$ & empty string / \texttt{""} / []\\
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 426
& $\mid$ & $c$ & single character\\
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 427
& $\mid$ & $r_1 + r_2$ & alternative / choice\\
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 428
& $\mid$ & $r_1 \cdot r_2$ & sequence\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 429
& $\mid$ & $r^*$ & star (zero or more)\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 430
\end{tabular}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 431
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 432
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 433
\noindent Because we overload our notation, there are some
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 434
subtleties you should be aware of. When regular expressions
404
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 435
are referred to, then $\ZERO$ (in bold font) does not stand for
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 436
the number zero: rather it is a particular pattern that does
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 437
not match any string. Similarly, in the context of regular
925
+ − 438
expressions, $\ONE$ does not stand for the number one, but for
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 439
a regular expression that matches the empty string. The letter
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 440
$c$ stands for any character from the alphabet at hand. Again
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 441
in the context of regular expressions, it is a particular
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 442
pattern that can match the specified character. You should
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 443
also be careful with our overloading of the star: assuming you
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 444
have read the handout about our basic mathematical notation,
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 445
you will see that in the context of languages (sets of
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 446
strings) the star stands for an operation on languages. Here
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 447
$r^*$ stands for a regular expression, which is different from
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 448
the operation on sets is defined as
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 449
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 450
\[
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 451
A\star\dn \bigcup_{0\le n} A^n
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 452
\]
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 453
334
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 454
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 455
We will use parentheses to disambiguate regular expressions.
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 456
Parentheses are not really part of a regular expression, and
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 457
indeed we do not need them in our code because there the tree
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 458
structure of regular expressions is always clear. But for
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 459
writing them down in a more mathematical fashion, parentheses
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 460
will be helpful. For example we will write $(r_1 + r_2)^*$,
742
+ − 461
which is different from, say $r_1 + (r_2)^*$. This can be
+ − 462
seen if we write regular expressions as trees:
+ − 463
+ − 464
\begin{center}
+ − 465
\includegraphics[scale=0.65]{../pics/r1.pdf}
+ − 466
\hspace{3cm}
+ − 467
\includegraphics[scale=0.65]{../pics/r2.pdf}
+ − 468
\end{center}
+ − 469
+ − 470
\noindent
+ − 471
The regular expression on the left means
+ − 472
roughly zero or more times $r_1$ or $r_2$, while the one on the right
550
+ − 473
means $r_1$, or zero or more times $r_2$. This will turn out to
248
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 474
be two different patterns, which match in general different
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 475
strings. We should also write $(r_1 + r_2) + r_3$, which is
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 476
different from the regular expression $r_1 + (r_2 + r_3)$, but
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 477
in case of $+$ and $\cdot$ we actually do not care about the
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 478
order and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2
550
+ − 479
\cdot r_3$, respectively. The reasons for this carelessness will become
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 480
clear shortly.
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 481
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 482
In the literature you will often find that the choice $r_1 +
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 483
r_2$ is written as $r_1\mid{}r_2$ or $r_1\mid\mid{}r_2$. Also,
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 484
often our $\ZERO$ and $\ONE$ are written $\varnothing$ and
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 485
$\epsilon$, respectively. Following the convention in the
550
+ − 486
literature, we will often omit the $\cdot$. This
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 487
is to make some concrete regular expressions more readable.
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 488
For example the regular expression for email addresses shown
550
+ − 489
in \eqref{email} would fully expanded look like
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 490
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 491
\[
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 492
\texttt{[...]+} \;\cdot\; \texttt{@} \;\cdot\;
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 493
\texttt{[...]+} \;\cdot\; \texttt{.} \;\cdot\;
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 494
\texttt{[...]\{2,6\}}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 495
\]
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 496
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 497
\noindent
471
+ − 498
which is much less readable than the regular expression in
+ − 499
\eqref{email}. Similarly for the regular expression that matches the
+ − 500
string $hello$ we should write
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 501
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 502
\[
248
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 503
h \cdot e \cdot l \cdot l \cdot o
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 504
\]
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 505
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 506
\noindent
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 507
but often just write {\it hello}.
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 508
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 509
If you prefer to think in terms of the implementation
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 510
of regular expressions in Scala, the constructors and
245
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 511
classes relate as follows\footnote{More about Scala is
830
+ − 512
in the handout about \emph{A Crash-Course on Scala} from PEP.}
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 513
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 514
\begin{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 515
\begin{tabular}{rcl}
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 516
$\ZERO$ & $\mapsto$ & \texttt{ZERO}\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 517
$\ONE$ & $\mapsto$ & \texttt{ONE}\\
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 518
$c$ & $\mapsto$ & \texttt{CHAR(c)}\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 519
$r_1 + r_2$ & $\mapsto$ & \texttt{ALT(r1, r2)}\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 520
$r_1 \cdot r_2$ & $\mapsto$ & \texttt{SEQ(r1, r2)}\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 521
$r^*$ & $\mapsto$ & \texttt{STAR(r)}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 522
\end{tabular}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 523
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 524
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 525
A source of confusion might arise from the fact that we
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 526
use the term \emph{basic regular expression} for the regular
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 527
expressions used in ``theory'' and defined above, and
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 528
\emph{extended regular expression} for the ones used in
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 529
``practice'', for example in Scala. If runtime is not an
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 530
issue, then the latter can be seen as syntactic sugar of
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 531
the former. For example we could replace
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 532
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 533
\begin{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 534
\begin{tabular}{rcl}
471
+ − 535
$r^+$ & $\mapsto$ & $r\cdot r^*$\\
+ − 536
$r^?$ & $\mapsto$ & $\ONE + r$\\
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 537
$\backslash d$ & $\mapsto$ & $0 + 1 + 2 + \ldots + 9$\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 538
$[\text{\it a - z}]$ & $\mapsto$ & $a + b + \ldots + z$\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 539
\end{tabular}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 540
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 541
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 542
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 543
\subsection*{The Meaning of Regular Expressions}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 544
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 545
So far we have only considered informally what the
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 546
\emph{meaning} of a regular expression is. This is not good
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 547
enough for specifications of what algorithms are supposed to
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 548
do or which problems they are supposed to solve.
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 549
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 550
To define the meaning of a regular expression we will
872
+ − 551
associate with every regular expression a language---a set of
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 552
strings. This language contains all the strings the regular
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 553
expression is supposed to match. To understand what is going
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 554
on here it is crucial that you have read the handout
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 555
about basic mathematical notations.
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 556
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 557
The \defn{meaning of a regular expression} can be defined
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 558
by a recursive function called $L$ (for language), which
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 559
is defined as follows
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 560
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 561
\begin{center}
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 562
\begin{tabular}{rcll}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 563
$L(\ZERO)$ & $\dn$ & $\{\}$\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 564
$L(\ONE)$ & $\dn$ & $\{[]\}$\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 565
$L(c)$ & $\dn$ & $\{"c"\}$ & or equivalently $\dn \{[c]\}$\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 566
$L(r_1+ r_2)$ & $\dn$ & $L(r_1) \cup L(r_2)$\\
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 567
$L(r_1 \cdot r_2)$ & $\dn$ & $L(r_1) \,@\, L(r_2)$\\
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 568
$L(r^*)$ & $\dn$ & $(L(r))\star$\\
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 569
\end{tabular}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 570
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 571
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 572
\noindent As a result we can now precisely state what the
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 573
meaning, for example, of the regular expression $h \cdot
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 574
e \cdot l \cdot l \cdot o$ is, namely
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 575
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 576
\[
248
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 577
L(h \cdot e \cdot l \cdot l \cdot o) = \{"hello"\}
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 578
\]
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 579
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 580
\noindent This is expected because this regular expression
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 581
is only supposed to match the ``$hello$''-string. Similarly if
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 582
we have the choice-regular-expression $a + b$, its meaning is
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 583
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 584
\[
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 585
L(a + b) = \{"a", "b"\}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 586
\]
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 587
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 588
\noindent You can now also see why we do not make a difference
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 589
between the different regular expressions $(r_1 + r_2) + r_3$
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 590
and $r_1 + (r_2 + r_3)$\ldots they are not the same regular
248
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 591
expression, but they have the same meaning. For example
318
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 592
you can do the following calculation which shows they
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 593
have the same meaning:
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 594
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 595
\begin{eqnarray*}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 596
L((r_1 + r_2) + r_3) & = & L(r_1 + r_2) \cup L(r_3)\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 597
& = & L(r_1) \cup L(r_2) \cup L(r_3)\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 598
& = & L(r_1) \cup L(r_2 + r_3)\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 599
& = & L(r_1 + (r_2 + r_3))
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 600
\end{eqnarray*}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 601
830
+ − 602
\noindent
+ − 603
That means both languages are the same.
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 604
The point of the definition of $L$ is that we can use it to
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 605
precisely specify when a string $s$ is matched by a regular
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 606
expression $r$, namely if and only if $s \in L(r)$. In fact we
872
+ − 607
will write a program \pcode{match} that takes a string $s$
+ − 608
and a regular expression $r$ as arguments and returns
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 609
\emph{yes}, if $s \in L(r)$ and \emph{no}, if $s \not\in
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 610
L(r)$. We leave this for the next lecture.
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 611
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 612
There is one more feature of regular expressions that is worth
872
+ − 613
mentioning here. Given some strings, there are in general many
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 614
different regular expressions that can recognise these
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 615
strings. This is obvious with the regular expression $a + b$
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 616
which can match the strings $a$ and $b$. But also the regular
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 617
expression $b + a$ would match the same strings. However,
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 618
sometimes it is not so obvious whether two regular expressions
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 619
match the same strings: for example do $r^*$ and $\ONE + r
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 620
\cdot r^*$ match the same strings? What about $\ZERO^*$
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 621
and $\ONE^*$? This suggests the following relation between
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 622
\defn{equivalent regular expressions}:
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 623
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 624
\[
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 625
r_1 \equiv r_2 \;\dn\; L(r_1) = L(r_2)
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 626
\]
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 627
248
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 628
\noindent That means two regular expressions are said to be
550
+ − 629
equivalent if they match the same set of strings. That is
872
+ − 630
their meanings are the same. Therefore we
248
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 631
do not really distinguish between the different regular
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 632
expression $(r_1 + r_2) + r_3$ and $r_1 + (r_2 + r_3)$,
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 633
because they are equivalent. I leave you to the question
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 634
whether
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 635
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 636
\[
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 637
\ZERO^* \equiv \ONE^*
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 638
\]
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 639
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 640
\noindent holds or not? Such equivalences will be important
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 641
for our matching algorithm, because we can use them to
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 642
simplify regular expressions, which will mean we can speed
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 643
up the calculations.
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 644
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 645
\subsection*{My Fascination for Regular Expressions}
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 646
471
+ − 647
Up until a few years ago I was not really interested in regular
+ − 648
expressions. They have been studied for the last 60 years (by smarter
+ − 649
people than me)---surely nothing new can be found out about them. I
+ − 650
even have the vague recollection that I did not quite understand them
+ − 651
during my undergraduate study. If I remember correctly,\footnote{That
+ − 652
was really a long time ago.} I got utterly confused about $\ONE$
+ − 653
(which my lecturer wrote as $\epsilon$) and the empty string (which he
+ − 654
also wrote as $\epsilon$).\footnote{Obviously the lecturer must have
550
+ − 655
been bad ;o)} Since then, I have used regular expressions for
471
+ − 656
implementing lexers and parsers as I have always been interested in
+ − 657
all kinds of programming languages and compilers, which invariably
+ − 658
need regular expressions in some form or shape.
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 659
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 660
To understand my fascination \emph{nowadays} with regular
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 661
expressions, you need to know that my main scientific interest
471
+ − 662
for the last 17 years has been with theorem provers. I am a
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 663
core developer of one of
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 664
them.\footnote{\url{http://isabelle.in.tum.de}} Theorem
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 665
provers are systems in which you can formally reason about
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 666
mathematical concepts, but also about programs. In this way
550
+ − 667
theorem provers can help with the menacing problem of writing bug-free code. Theorem provers have
416
+ − 668
proved already their value in a number of cases (even in
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 669
terms of hard cash), but they are still clunky and difficult
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 670
to use by average programmers.
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 671
471
+ − 672
Anyway, in about 2011 I came across the notion of \defn{derivatives of
+ − 673
regular expressions}. This notion allows one to do almost all
+ − 674
calculations with regular expressions on the level of regular
+ − 675
expressions, not needing any automata (you will see we only touch
+ − 676
briefly on automata in lecture 3). Automata are usually the main
+ − 677
object of study in formal language courses. The avoidance of automata
924
+ − 678
is crucial for me because automata are graphs and it is rather
+ − 679
difficult to reason about graphs in theorem provers. In contrast,
+ − 680
reasoning about regular expressions is easy-peasy in theorem
+ − 681
provers. Is this important? I think yes, because according to
+ − 682
Kuklewicz nearly all POSIX-based regular expression matchers are
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 683
buggy.\footnote{\url{http://www.haskell.org/haskellwiki/Regex_Posix}}
924
+ − 684
With my PhD students Fahad Ausaf and Chengsong Tan, I proved the
+ − 685
correctness for two such matchers that were proposed by Sulzmann and Lu
+ − 686
in 2014.\footnote{\url{http://goo.gl/bz0eHp}} A variant of which you have
+ − 687
already seen in PEP as CW3 and you will see again in the CFL in the first
+ − 688
two CWs. What we have not yet figured out that our matchers are
+ − 689
universally fast, meaning they do not explode on any input.
+ − 690
Hopefully we can also prove
+ − 691
that the machine code(!) that implements our matchers efficiently is
+ − 692
correct also. Writing programs in this way does not leave any room for
+ − 693
any errors or bugs. How nice!
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 694
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 695
What also helped with my fascination with regular expressions
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 696
is that we could indeed find out new things about them that
416
+ − 697
have surprised some experts. Together with two colleagues from China, I was
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 698
able to prove the Myhill-Nerode theorem by only using regular
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 699
expressions and the notion of derivatives. Earlier versions of
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 700
this theorem used always automata in the proof. Using this
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 701
theorem we can show that regular languages are closed under
830
+ − 702
complementation, something which Bill Gasarch in his Computational Complexity
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 703
blog\footnote{\url{http://goo.gl/2R11Fw}} assumed can only be
550
+ − 704
shown via automata. So even somebody who has written a 700+-page
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 705
book\footnote{\url{http://goo.gl/fD0eHx}} on regular
550
+ − 706
expressions did not know better. Well, we showed it can also be
924
+ − 707
done with regular expressions only.\footnote{\url{https://nms.kcl.ac.uk/christian.urban/Publications/rexp.pdf}}
471
+ − 708
What a feeling when you are an outsider to the subject!
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 709
471
+ − 710
To conclude: Despite my early ignorance about regular expressions, I
550
+ − 711
find them now extremely interesting. They have practical importance
471
+ − 712
(remember the shocking runtime of the regular expression matchers in
924
+ − 713
Python, Ruby, Swift and Java in some instances and the problems in Stack
+ − 714
Exchange and the Atom editor---even regex libraries in more modern programming languages, like Rust, have their problems). They are used in tools like Snort and
872
+ − 715
Zeek in order to monitor network traffic. They have a beautiful mathematical
550
+ − 716
theory behind them, which can be sometimes quite deep and which
+ − 717
sometimes contains hidden snares. People who are not very familiar
+ − 718
with the mathematical background of regular expressions get them
492
+ − 719
consistently wrong (this is surprising given they are a supposed to be
872
+ − 720
a core skill for computer scientists). The hope is that we can do better
492
+ − 721
in the future---for example by proving that the algorithms actually
471
+ − 722
satisfy their specification and that the corresponding implementations
+ − 723
do not contain any bugs. We are close, but not yet quite there.
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 724
332
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 725
Notwithstanding my fascination, I am also happy to admit that regular
244
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 726
expressions have their shortcomings. There are some well-known
471
+ − 727
``theoretical'' shortcomings, for example recognising strings of the
+ − 728
form $a^{n}b^{n}$ is not possible with regular expressions. This means
550
+ − 729
for example if we try to recognise whether parentheses are well-nested
492
+ − 730
in an expression is impossible with (basic) regular expressions. I am
+ − 731
not so bothered by these shortcomings. What I am bothered about is
+ − 732
when regular expressions are in the way of practical programming. For
+ − 733
example, it turns out that the regular expression for email addresses
+ − 734
shown in \eqref{email} is hopelessly inadequate for recognising all of
+ − 735
them (despite being touted as something every computer scientist
+ − 736
should know about). The W3 Consortium (which standardises the Web)
+ − 737
proposes to use the following, already more complicated regular
+ − 738
expressions for email addresses:
244
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 739
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 740
{\small\begin{lstlisting}[language={},keywordstyle=\color{black},numbers=none]
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 741
[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 742
\end{lstlisting}}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 743
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 744
\noindent But they admit that by using this regular expression
471
+ − 745
they wilfully violate the RFC 5322 standard, which specifies
244
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 746
the syntax of email addresses. With their proposed regular
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 747
expression they are too strict in some cases and too lax in
872
+ − 748
others\ldots{}not a good situation to be in. A regular expression
244
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 749
that is claimed to be closer to the standard is shown in
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 750
Figure~\ref{monster}. Whether this claim is true or not, I
416
+ − 751
would not know---the only thing I can say about this regular
248
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 752
expression is it is a monstrosity. However, this might
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 753
actually be an argument against the RFC standard, rather than
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 754
against regular expressions. A similar argument is made in
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 755
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 756
\begin{center}
570
+ − 757
\url{http://elliot.land/post/its-impossible-to-validate-an-email-address}
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 758
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 759
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 760
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 761
\noindent which explains some of the crazier parts of email
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 762
addresses. Still it is good to know that some tasks in text
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 763
processing just cannot be achieved by using regular
471
+ − 764
expressions. But for what we want to use them (lexing) they are
473
+ − 765
pretty good.\medskip
+ − 766
+ − 767
\noindent
+ − 768
Finally there is a joke about regular expressions:
+ − 769
+ − 770
\begin{quote}\it
+ − 771
``Sometimes you have a programming problem and it seems like the
+ − 772
best solution is to use regular expressions; now you have two
+ − 773
problems.''
+ − 774
\end{quote}
244
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 775
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 776
924
+ − 777
%\begin{figure}[p]\small
+ − 778
% \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
+ − 779
% {../progs/crawler1.scala}
+ − 780
%
+ − 781
%\caption{The Scala code for a simple web-crawler that checks
+ − 782
%for broken links in web-pages. It uses the regular expression
+ − 783
%\texttt{http\_pattern} in Line~\ref{httpline} for recognising
+ − 784
%URL-addresses. It finds all links using the library function
+ − 785
%\texttt{findAllIn} in Line~\ref{findallline} (this function
+ − 786
%is part of Scala's regular expression library).\label{crawler1}}
+ − 787
%
+ − 788
%\end{figure}
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 789
722
+ − 790
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 791
722
+ − 792
%\begin{figure}[p]\small
+ − 793
% \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
+ − 794
% {../progs/crawler2.scala}
+ − 795
%
+ − 796
%\caption{A version of the web-crawler that only follows links
+ − 797
%in ``my'' domain---since these are the ones I am interested in
+ − 798
%to fix. It uses the regular expression \texttt{my\_urls} in
+ − 799
%Line~\ref{myurlline} to check for my name in the links. The
+ − 800
%main change is in
+ − 801
%Lines~\ref{changestartline}--\ref{changeendline} where there
+ − 802
%is a test whether URL is in ``my'' domain or
+ − 803
%not.\label{crawler2}}
+ − 804
%\end{figure}
477
+ − 805
924
+ − 806
%\begin{figure}[p]\small
+ − 807
% \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\co%lor{capri!3}\fi}]
+ − 808
% {../progs/crawler2.scala}
+ − 809
%
+ − 810
%\caption{A small email harvester---whenever we download a
+ − 811
%web-page, we also check whether it contains any email
+ − 812
%addresses. For this we use the regular expression
+ − 813
%\texttt{email\_pattern} in Line~\ref{emailline}. The main
+ − 814
%change is in Line~\ref{mainline} where all email addresses
+ − 815
%that can be found in a page are printed.\label{crawler3}}
+ − 816
%
+ − 817
%\end{figure}
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 818
244
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 819
\begin{figure}[p]
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 820
\tiny
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 821
\begin{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 822
\begin{minipage}{0.8\textwidth}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 823
\lstinputlisting[language={},keywordstyle=\color{black},numbers=none]{../progs/email-rexp}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 824
\end{minipage}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 825
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 826
404
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 827
\caption{Nothing that can be said about this regular
416
+ − 828
expression\ldots{}except it is a monstrosity.\label{monster}}
244
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 829
\end{figure}
108
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 830
112
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 831
105
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
+ − 832
\end{document}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
+ − 833
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
+ − 834
%%% Local Variables:
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
+ − 835
%%% mode: latex
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
+ − 836
%%% TeX-master: t
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
+ − 837
%%% End: