621
|
1 |
% !TEX program = xelatex
|
105
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
2 |
\documentclass{article}
|
237
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
3 |
\usepackage{../style}
|
217
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
4 |
\usepackage{../langs}
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
5 |
\usepackage{../graphics}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
6 |
\usepackage{../data}
|
477
|
7 |
\usepackage{lstlinebgrd}
|
|
8 |
\definecolor{capri}{rgb}{0.0, 0.75, 1.0}
|
108
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
9 |
|
306
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
10 |
%%http://regexcrossword.com/challenges/cities/puzzles/1
|
398
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
11 |
%%https://jex.im/regulex/
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
12 |
%%https://www.reddit.com/r/ProgrammingLanguages/comments/42dlem/mona_compiler_development_part_2_parsing/
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
13 |
%%https://www.reddit.com/r/ProgrammingLanguages/comments/43wlkq/formal_grammar_for_csh_tsch_sh_or_bash/
|
112
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
14 |
|
403
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
15 |
%% regex displayers
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
16 |
%% https://regexper.com/#a%7Ca
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
17 |
%% https://www.debuggex.com
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
18 |
%% https://jex.im/regulex/#!embed=false&flags=&re=%5E(a%7Cb)*%3F%24
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
19 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
20 |
%% email validator
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
21 |
%% http://www.ex-parrot.com/%7Epdw/Mail-RFC822-Address.html
|
496
|
22 |
% https://jackfoxy.github.io/FsRegEx/emailregex.html
|
|
23 |
|
403
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
24 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
25 |
%% regex testers
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
26 |
% https://regex101.com
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
27 |
% http://regexr.com
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
28 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
29 |
%% emacs regexes
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
30 |
%% https://www.gnu.org/software/emacs/manual/html_node/elisp/Regular-Expressions.html
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
31 |
|
551
|
32 |
%% reasons for a new programming language
|
403
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
33 |
%% http://beautifulracket.com
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
34 |
|
418
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
35 |
% compiler explorer
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
36 |
% https://gcc.godbolt.org
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
37 |
|
622
|
38 |
|
716
|
39 |
% good article how languages have been influenced
|
|
40 |
% 10 MOST(LY DEAD) INFLUENTIAL PROGRAMMING LANGUAGES
|
|
41 |
% https://www.hillelwayne.com/post/influential-dead-languages/
|
|
42 |
|
622
|
43 |
|
105
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
44 |
\begin{document}
|
723
|
45 |
\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2018, 2019, 2020}
|
105
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
46 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
47 |
\section*{Handout 1}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
48 |
|
742
|
49 |
|
722
|
50 |
The purpose of a compiler is to transform a program a human can read and
|
|
51 |
write into code the machine can run as fast as possible. Developing a
|
|
52 |
compiler is an old craft going back to 1952 with the first compiler
|
|
53 |
written by Grace Hopper.\footnote{Who many years ago was invited on a
|
743
|
54 |
talk show hosted by David Letterman.
|
745
|
55 |
\here{https://youtu.be/3N_ywhx6_K0?t=31}} Why studying compilers
|
722
|
56 |
nowadays? An interesting answer is given by John Regher in his compiler
|
743
|
57 |
blog:\here{http://blog.regehr.org/archives/1419}
|
690
|
58 |
|
|
59 |
\begin{quote}\it{}``We can start off with a couple of observations
|
|
60 |
about the role of compilers. First, hardware is getting weirder
|
|
61 |
rather than getting clocked faster: almost all processors are
|
|
62 |
multicores and it looks like there is increasing asymmetry in
|
|
63 |
resources across cores. Processors come with vector units, crypto
|
|
64 |
accelerators, bit twiddling instructions, and lots of features to
|
|
65 |
make virtualization and concurrency work. We have DSPs, GPUs,
|
|
66 |
big.little, and Xeon Phi. This is only scratching the
|
|
67 |
surface. Second, we’re getting tired of low-level languages and
|
|
68 |
their associated security disasters, we want to write new code, to
|
|
69 |
whatever extent possible, in safer, higher-level
|
|
70 |
languages. Compilers are caught right in the middle of these
|
|
71 |
opposing trends: one of their main jobs is to help bridge the large
|
|
72 |
and growing gap between increasingly high-level languages and
|
|
73 |
increasingly wacky platforms. It’s effectively a perpetual
|
|
74 |
employment act for solid compiler hackers.''
|
|
75 |
\end{quote}
|
|
76 |
|
722
|
77 |
\noindent
|
|
78 |
So the goal of this module is to become a solid (beginner) compiler
|
|
79 |
hacker and as part of the coursework to implement a small
|
|
80 |
compiler for a very small programming language.
|
690
|
81 |
|
722
|
82 |
The first part of the module is about the problem of text processing,
|
|
83 |
which is needed for compilers, but also for dictionaries, DNA-data,
|
|
84 |
spam-filters and so on. When looking for a particular string, say
|
|
85 |
\pcode{"foobar"}, in a large text we can use the Knuth-Morris-Pratt
|
|
86 |
algorithm, which is currently the most efficient general string search
|
|
87 |
algorithm. But often we do \emph{not} just look for a particular string,
|
|
88 |
but for string patterns. For example, in program code we need to
|
|
89 |
identify what are the keywords (\texttt{if}, \texttt{then},
|
|
90 |
\texttt{while}, \texttt{for}, etc) and what are the identifiers
|
|
91 |
(variable names). A pattern for identifiers could be stated as: they
|
|
92 |
start with a letter, followed by zero or more letters, numbers and
|
|
93 |
underscores.
|
618
|
94 |
|
621
|
95 |
%You might also be surprised what
|
|
96 |
%constraints programming languages impose about numbers: for example
|
|
97 |
%123 in JSON is OK, but 0123 is not.
|
706
|
98 |
%
|
|
99 |
% The regex for JASON numbers is
|
|
100 |
%
|
|
101 |
% -?(0|[1-9][0-9]*)(\.[0-9]+)?([eE][+-]?[0-9]+)?
|
621
|
102 |
|
|
103 |
Often we also face the problem that we are given a string, for example
|
|
104 |
some user input, and we want to know whether it matches a particular
|
622
|
105 |
pattern---is it an email address, for example. In this way we can
|
618
|
106 |
exclude user input that would otherwise have nasty effects on our
|
|
107 |
program (crashing it or making it go into an infinite loop, if not
|
622
|
108 |
worse). This kind of ``inspecting'' mechanism is also implemented in
|
|
109 |
popular network security tools such as Snort and
|
743
|
110 |
Bro.\here{www.snort.org}\here{www.bro.org} They scan incoming
|
622
|
111 |
network traffic for computer viruses or malicious packets. Similarly
|
|
112 |
filtering out spam usually involves looking for some signature
|
621
|
113 |
(essentially a string pattern). The point is that the fast
|
618
|
114 |
Knuth-Morris-Pratt algorithm for strings is not good enough for such
|
|
115 |
string \emph{patterns}.\smallskip
|
291
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
116 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
117 |
\defn{Regular expressions} help with conveniently specifying
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
118 |
such patterns. The idea behind regular expressions is that
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
119 |
they are a simple method for describing languages (or sets of
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
120 |
strings)\ldots at least languages we are interested in in
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
121 |
computer science. For example there is no convenient regular
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
122 |
expression for describing the English language short of
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
123 |
enumerating all English words. But they seem useful for
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
124 |
describing for example simple email addresses.\footnote{See
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
125 |
``8 Regular Expressions You Should Know''
|
291
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
126 |
\url{http://goo.gl/5LoVX7}} Consider the following regular
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
127 |
expression
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
128 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
129 |
\begin{equation}\label{email}
|
416
|
130 |
\texttt{[a-z0-9\_.-]+} \;\;\texttt{@}\;\; \texttt{[a-z0-9.-]+} \;\;\texttt{.}\;\; \texttt{[a-z.]\{2,6\}}
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
131 |
\end{equation}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
132 |
|
621
|
133 |
\noindent where the first part, the user name, matches one or more
|
|
134 |
lowercase letters (\pcode{a-z}), digits (\pcode{0-9}), underscores, dots
|
|
135 |
and hyphens. The \pcode{+} at the end of the brackets ensures the ``one
|
|
136 |
or more''. Then comes the email \pcode{@}-sign, followed by the domain
|
|
137 |
name which must be one or more lowercase letters, digits, underscores,
|
|
138 |
dots or hyphens (but no underscores). Finally there must be a dot
|
|
139 |
followed by the toplevel domain. This toplevel domain must be 2 to 6
|
|
140 |
lowercase letters including the dot. Example strings which follow this
|
|
141 |
pattern are:
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
142 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
143 |
\begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}]
|
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
144 |
niceandsimple@example.org
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
145 |
very.common@example.co.uk
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
146 |
a.little.lengthy.but.fine@dept.example.ac.uk
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
147 |
other.email-with-dash@example.edu
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
148 |
\end{lstlisting}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
149 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
150 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
151 |
\noindent
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
152 |
But for example the following two do not
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
153 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
154 |
\begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}]
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
155 |
user@localserver
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
156 |
disposable.style.email.with+symbol@example.com
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
157 |
\end{lstlisting}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
158 |
|
471
|
159 |
\noindent according to the regular expression we specified in line
|
|
160 |
\eqref{email} above. Whether this is intended or not is a different
|
|
161 |
question (the second email above is actually an acceptable email
|
550
|
162 |
address according to the RFC 5322 standard for email addresses).
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
163 |
|
327
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
164 |
As mentioned above, identifiers, or variables, in program code
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
165 |
are often required to satisfy the constraints that they start
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
166 |
with a letter and then can be followed by zero or more letters
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
167 |
or numbers and also can include underscores, but not as the
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
168 |
first character. Such identifiers can be recognised with the
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
169 |
regular expression
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
170 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
171 |
\begin{center}
|
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
172 |
\pcode{[a-zA-Z] [a-zA-Z0-9_]*}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
173 |
\end{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
174 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
175 |
\noindent Possible identifiers that match this regular expression
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
176 |
are \pcode{x}, \pcode{foo}, \pcode{foo_bar_1}, \pcode{A_very_42_long_object_name},
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
177 |
but not \pcode{_i} and also not \pcode{4you}.
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
178 |
|
404
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
179 |
Many programming languages offer libraries that can be used to
|
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
180 |
validate such strings against regular expressions. Also there
|
248
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
181 |
are some common, and I am sure very familiar, ways of how to
|
404
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
182 |
construct regular expressions. For example in Scala we have
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
183 |
a library implementing the following regular expressions:
|
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
184 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
185 |
\begin{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
186 |
\begin{tabular}{lp{9cm}}
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
187 |
\pcode{re*} & matches 0 or more occurrences of preceding
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
188 |
expression\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
189 |
\pcode{re+} & matches 1 or more occurrences of preceding
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
190 |
expression\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
191 |
\pcode{re?} & matches 0 or 1 occurrence of preceding
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
192 |
expression\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
193 |
\pcode{re\{n\}} & matches exactly \pcode{n} number of
|
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
194 |
occurrences of preceding expression\\
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
195 |
\pcode{re\{n,m\}} & matches at least \pcode{n} and at most {\tt m}
|
550
|
196 |
occurrences of the preceding expression\\
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
197 |
\pcode{[...]} & matches any single character inside the
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
198 |
brackets\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
199 |
\pcode{[^...]} & matches any single character not inside the
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
200 |
brackets\\
|
250
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
201 |
\pcode{...-...} & character ranges\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
202 |
\pcode{\\d} & matches digits; equivalent to \pcode{[0-9]}\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
203 |
\pcode{.} & matches every character except newline\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
204 |
\pcode{(re)} & groups regular expressions and remembers
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
205 |
matched text
|
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
206 |
\end{tabular}
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
207 |
\end{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
208 |
|
722
|
209 |
\noindent With this table you can figure out the purpose of the regular
|
|
210 |
expressions in the web-crawlers shown Figures \ref{crawler1} and
|
723
|
211 |
\ref{crawler3}. In Figure~\ref{crawler1}, however, be careful with
|
722
|
212 |
the regular expression for http-addresses in Line~\ref{httpline}. It is
|
|
213 |
intended to be
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
214 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
215 |
\[
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
216 |
\pcode{"https?://[^"]*"}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
217 |
\]
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
218 |
|
722
|
219 |
\noindent specifying that http-addresses need to start with a double
|
|
220 |
quote, then comes \texttt{http} followed by an optional \texttt{s} and
|
|
221 |
so on\ldots{}until the closing double quote comes at the end of the
|
|
222 |
address. Normally we would have to escape the double quotes in order to
|
|
223 |
make sure we interpret the double quote as character, not as double
|
|
224 |
quote for a string. But Scala's trick with triple quotes allows us to
|
|
225 |
omit this kind of ugly escaping. As a result we can just write:
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
226 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
227 |
\[
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
228 |
\pcode{""""https?://[^"]*"""".r}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
229 |
\]
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
230 |
|
621
|
231 |
\noindent The convention in Scala is that \texttt{.r} converts a string
|
|
232 |
into a regular expression. I leave it to you to ponder whether this
|
|
233 |
regular expression really captures all possible web-addresses. If you
|
|
234 |
need a quick recap about regular expressions and how the match strings,
|
|
235 |
here is a quick video: \url{https://youtu.be/bgBWp9EIlMM}.
|
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
236 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
237 |
\subsection*{Why Study Regular Expressions?}
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
238 |
|
471
|
239 |
Regular expressions were introduced by Kleene in the 1950ies and they
|
|
240 |
have been object of intense study since then. They are nowadays pretty
|
|
241 |
much ubiquitous in computer science. There are many libraries
|
|
242 |
implementing regular expressions. I am sure you have come across them
|
622
|
243 |
before (remember the PRA or PEP modules?).
|
621
|
244 |
|
|
245 |
Why on earth then is there any interest in studying them again in depth
|
|
246 |
in this module? Well, one answer is in the following two graphs about
|
|
247 |
regular expression matching in Python, Ruby, JavaScript and Java
|
|
248 |
(Version 8).
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
249 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
250 |
\begin{center}
|
621
|
251 |
\begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1.5mm}}c@{}}
|
477
|
252 |
\begin{tikzpicture}
|
|
253 |
\begin{axis}[
|
|
254 |
title={Graph: $\texttt{(a*)*\,b}$ and strings
|
|
255 |
$\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
|
|
256 |
xlabel={$n$},
|
|
257 |
x label style={at={(1.05,0.0)}},
|
|
258 |
ylabel={time in secs},
|
|
259 |
enlargelimits=false,
|
|
260 |
xtick={0,5,...,30},
|
|
261 |
xmax=33,
|
|
262 |
ymax=35,
|
|
263 |
ytick={0,5,...,30},
|
|
264 |
scaled ticks=false,
|
|
265 |
axis lines=left,
|
|
266 |
width=5.5cm,
|
|
267 |
height=4.5cm,
|
618
|
268 |
legend entries={Python, Java 8, JavaScript},
|
477
|
269 |
legend pos=north west,
|
|
270 |
legend cell align=left]
|
|
271 |
\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
|
|
272 |
\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
|
618
|
273 |
\addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
|
477
|
274 |
\end{axis}
|
|
275 |
\end{tikzpicture}
|
|
276 |
&
|
291
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
277 |
\begin{tikzpicture}
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
278 |
\begin{axis}[
|
415
|
279 |
title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings
|
|
280 |
$\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
|
|
281 |
xlabel={$n$},
|
|
282 |
x label style={at={(1.05,0.0)}},
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
283 |
ylabel={time in secs},
|
291
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
284 |
enlargelimits=false,
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
285 |
xtick={0,5,...,30},
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
286 |
xmax=33,
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
287 |
ymax=35,
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
288 |
ytick={0,5,...,30},
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
289 |
scaled ticks=false,
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
290 |
axis lines=left,
|
415
|
291 |
width=5.5cm,
|
318
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
292 |
height=4.5cm,
|
291
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
293 |
legend entries={Python,Ruby},
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
294 |
legend pos=north west,
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
295 |
legend cell align=left]
|
448
|
296 |
\addplot[blue,mark=*, mark options={fill=white}] table {re-python.data};
|
|
297 |
\addplot[brown,mark=triangle*, mark options={fill=white}] table {re-ruby.data};
|
291
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
298 |
\end{axis}
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
299 |
\end{tikzpicture}
|
415
|
300 |
\end{tabular}
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
301 |
\end{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
302 |
|
618
|
303 |
\noindent This first graph shows that Python, JavaScript and Java 8 need
|
477
|
304 |
approximately 30 seconds to find out that the regular expression
|
621
|
305 |
$\texttt{(a*)*\,b}$ does not match strings of 28 \texttt{a}s. Similarly,
|
|
306 |
the second shows that Python and Ruby need approximately 29 seconds for finding
|
|
307 |
out whether a string of 28 \texttt{a}s matches the regular expression
|
|
308 |
\texttt{a?\{28\}\,a\{28\}}.\footnote{In this
|
|
309 |
example Ruby uses actually the slightly different regular expression
|
|
310 |
\texttt{a?a?a?...a?a?aaa...aa}, where the \texttt{a?} and \texttt{a}
|
|
311 |
each occur $n$ times. More such test cases can be found at
|
|
312 |
\url{https://www.owasp.org/index.php/Regular_expression_Denial_of_Service_-_ReDoS}.}
|
477
|
313 |
Admittedly, these regular expressions are carefully chosen to exhibit
|
|
314 |
this exponential behaviour, but similar ones occur more often than one
|
|
315 |
wants in ``real life''. For example, on 20 July 2016 a similar regular
|
|
316 |
expression brought the webpage \href{http://stackexchange.com}{Stack
|
621
|
317 |
Exchange} to its knees:
|
407
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
318 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
319 |
\begin{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
320 |
\url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
321 |
\end{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
322 |
|
417
|
323 |
\noindent I can also highly recommend a cool webtalk from an engineer
|
|
324 |
from Stack Exchange on the same subject:
|
|
325 |
|
|
326 |
\begin{center}
|
|
327 |
\url{https://vimeo.com/112065252}
|
|
328 |
\end{center}
|
|
329 |
|
407
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
330 |
\noindent
|
550
|
331 |
A similar problem also occurred in the Atom editor:
|
416
|
332 |
|
|
333 |
\begin{center}
|
|
334 |
\url{http://davidvgalbraith.com/how-i-fixed-atom/}
|
|
335 |
\end{center}
|
|
336 |
|
563
|
337 |
\noindent
|
621
|
338 |
and also when somebody tried to match web-addresses using a
|
|
339 |
relatively simple regular expression
|
554
|
340 |
|
|
341 |
\begin{center}
|
|
342 |
\url{https://www.tutorialdocs.com/article/regex-trap.html}
|
|
343 |
\end{center}
|
|
344 |
|
621
|
345 |
\noindent
|
|
346 |
Finally, on 2 July 2019 Cloudflare had a global outage because of a
|
|
347 |
regular expression (they had no outage for the last 6 years). What
|
|
348 |
happened is nicely explained in the blog:
|
|
349 |
|
|
350 |
\begin{center}
|
|
351 |
\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019}
|
|
352 |
\end{center}
|
563
|
353 |
|
415
|
354 |
Such troublesome regular expressions are sometimes called \emph{evil
|
722
|
355 |
regular expressions} because they have the potential to make regular
|
|
356 |
expression matching engines to topple over, like in Python, Ruby,
|
|
357 |
JavaScript and Java 8. This ``toppling over'' is also sometimes called
|
|
358 |
\emph{catastrophic backtracking}. I have also seen the term
|
|
359 |
\emph{eternal matching} used for this. The problem with evil regular
|
|
360 |
expressions and catastrophic backtracking is that they can have some
|
|
361 |
serious consequences, for example, if you use them in your
|
|
362 |
web-application. The reason is that hackers can look for these instances
|
|
363 |
where the matching engine behaves badly and then mount a nice DoS-attack
|
|
364 |
against your application. These attacks are already have their own name:
|
|
365 |
\emph{Regular Expression Denial of Service Attacks (ReDoS)}.
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
366 |
|
622
|
367 |
It will be instructive to look behind the ``scenes'' to find out why
|
|
368 |
Python and Ruby (and others) behave so badly when matching strings with
|
|
369 |
evil regular expressions. But we will also look at a relatively simple
|
|
370 |
algorithm that solves this problem much better than Python and Ruby
|
|
371 |
do\ldots actually it will be two versions of the algorithm: the first
|
|
372 |
one will be able in the example \texttt{a?\{n\}\,a\{n\}} to process strings of
|
|
373 |
approximately 1,100 \texttt{a}s in 23 seconds, while the second version
|
|
374 |
will even be able to process up to 11,000(!) in 5 seconds, see the graph
|
|
375 |
below:
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
376 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
377 |
\begin{center}
|
291
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
378 |
\begin{tikzpicture}
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
379 |
\begin{axis}[
|
415
|
380 |
title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings
|
|
381 |
$\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
|
|
382 |
xlabel={$n$},
|
|
383 |
x label style={at={(1.05,0.0)}},
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
384 |
ylabel={time in secs},
|
291
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
385 |
enlargelimits=false,
|
477
|
386 |
xtick={0,3000,...,12000},
|
|
387 |
xmax=13000,
|
443
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
388 |
ymax=32,
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
389 |
ytick={0,5,...,30},
|
291
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
390 |
scaled ticks=false,
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
391 |
axis lines=left,
|
415
|
392 |
width=7cm,
|
622
|
393 |
height=4.4cm,
|
415
|
394 |
legend entries={Our Algorithm V1, Our Algorithm V2},
|
|
395 |
legend pos=outer north east]
|
|
396 |
\addplot[green,mark=square*,mark options={fill=white}] table {re2.data};
|
291
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
397 |
\addplot[black,mark=square*,mark options={fill=white}] table {re3.data};
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
398 |
\end{axis}
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
399 |
\end{tikzpicture}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
400 |
\end{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
401 |
|
415
|
402 |
\noindent And in the case of the regular expression $\texttt{(a*)*\,b}$
|
563
|
403 |
and strings of \texttt{a}s we will beat Java 8 by factor of
|
415
|
404 |
approximately 1,000,000 (note the scale on the $x$-axis).
|
|
405 |
|
|
406 |
\begin{center}
|
|
407 |
\begin{tikzpicture}
|
|
408 |
\begin{axis}[
|
|
409 |
title={Graph: $\texttt{(a*)*\,b}$ and strings
|
|
410 |
$\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
|
|
411 |
xlabel={$n$},
|
|
412 |
x label style={at={(1.05,0.0)}},
|
|
413 |
ylabel={time in secs},
|
|
414 |
enlargelimits=false,
|
443
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
415 |
ymax=32,
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
416 |
ytick={0,5,...,30},
|
415
|
417 |
axis lines=left,
|
|
418 |
width=7cm,
|
|
419 |
height=4.5cm,
|
|
420 |
legend entries={Our Algorithm V2},
|
|
421 |
legend pos=outer north east]
|
|
422 |
\addplot[black,mark=square*,mark options={fill=white}] table {re3a.data};
|
|
423 |
\end{axis}
|
|
424 |
\end{tikzpicture}
|
|
425 |
\end{center}
|
|
426 |
|
722
|
427 |
\noindent
|
|
428 |
You might have wondered above why I looked at the (now) old Java 8: the
|
|
429 |
reason is that Java 9 and later versions are a bit better, but we will
|
|
430 |
still beat them hands down with our regex matcher.
|
|
431 |
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
432 |
\subsection*{Basic Regular Expressions}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
433 |
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
434 |
The regular expressions shown earlier for Scala, we
|
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
435 |
will call \emph{extended regular expressions}. The ones we
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
436 |
will mainly study in this module are \emph{basic regular
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
437 |
expressions}, which by convention we will just call
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
438 |
\emph{regular expressions}, if it is clear what we mean. The
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
439 |
attraction of (basic) regular expressions is that many
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
440 |
features of the extended ones are just syntactic sugar.
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
441 |
(Basic) regular expressions are defined by the following
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
442 |
grammar:
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
443 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
444 |
\begin{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
445 |
\begin{tabular}{r@{\hspace{1mm}}r@{\hspace{1mm}}l@{\hspace{13mm}}l}
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
446 |
$r$ & $::=$ & $\ZERO$ & null language\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
447 |
& $\mid$ & $\ONE$ & empty string / \texttt{""} / []\\
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
448 |
& $\mid$ & $c$ & single character\\
|
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
449 |
& $\mid$ & $r_1 + r_2$ & alternative / choice\\
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
450 |
& $\mid$ & $r_1 \cdot r_2$ & sequence\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
451 |
& $\mid$ & $r^*$ & star (zero or more)\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
452 |
\end{tabular}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
453 |
\end{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
454 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
455 |
\noindent Because we overload our notation, there are some
|
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
456 |
subtleties you should be aware of. When regular expressions
|
404
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
457 |
are referred to, then $\ZERO$ (in bold font) does not stand for
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
458 |
the number zero: rather it is a particular pattern that does
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
459 |
not match any string. Similarly, in the context of regular
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
460 |
expressions, $\ONE$ does not stand for the number one but for
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
461 |
a regular expression that matches the empty string. The letter
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
462 |
$c$ stands for any character from the alphabet at hand. Again
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
463 |
in the context of regular expressions, it is a particular
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
464 |
pattern that can match the specified character. You should
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
465 |
also be careful with our overloading of the star: assuming you
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
466 |
have read the handout about our basic mathematical notation,
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
467 |
you will see that in the context of languages (sets of
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
468 |
strings) the star stands for an operation on languages. Here
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
469 |
$r^*$ stands for a regular expression, which is different from
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
470 |
the operation on sets is defined as
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
471 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
472 |
\[
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
473 |
A\star\dn \bigcup_{0\le n} A^n
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
474 |
\]
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
475 |
|
334
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
476 |
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
477 |
We will use parentheses to disambiguate regular expressions.
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
478 |
Parentheses are not really part of a regular expression, and
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
479 |
indeed we do not need them in our code because there the tree
|
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
480 |
structure of regular expressions is always clear. But for
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
481 |
writing them down in a more mathematical fashion, parentheses
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
482 |
will be helpful. For example we will write $(r_1 + r_2)^*$,
|
742
|
483 |
which is different from, say $r_1 + (r_2)^*$. This can be
|
|
484 |
seen if we write regular expressions as trees:
|
|
485 |
|
|
486 |
\begin{center}
|
|
487 |
\includegraphics[scale=0.65]{../pics/r1.pdf}
|
|
488 |
\hspace{3cm}
|
|
489 |
\includegraphics[scale=0.65]{../pics/r2.pdf}
|
|
490 |
\end{center}
|
|
491 |
|
|
492 |
\noindent
|
|
493 |
The regular expression on the left means
|
|
494 |
roughly zero or more times $r_1$ or $r_2$, while the one on the right
|
550
|
495 |
means $r_1$, or zero or more times $r_2$. This will turn out to
|
248
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
496 |
be two different patterns, which match in general different
|
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
497 |
strings. We should also write $(r_1 + r_2) + r_3$, which is
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
498 |
different from the regular expression $r_1 + (r_2 + r_3)$, but
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
499 |
in case of $+$ and $\cdot$ we actually do not care about the
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
500 |
order and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2
|
550
|
501 |
\cdot r_3$, respectively. The reasons for this carelessness will become
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
502 |
clear shortly.
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
503 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
504 |
In the literature you will often find that the choice $r_1 +
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
505 |
r_2$ is written as $r_1\mid{}r_2$ or $r_1\mid\mid{}r_2$. Also,
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
506 |
often our $\ZERO$ and $\ONE$ are written $\varnothing$ and
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
507 |
$\epsilon$, respectively. Following the convention in the
|
550
|
508 |
literature, we will often omit the $\cdot$. This
|
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
509 |
is to make some concrete regular expressions more readable.
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
510 |
For example the regular expression for email addresses shown
|
550
|
511 |
in \eqref{email} would fully expanded look like
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
512 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
513 |
\[
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
514 |
\texttt{[...]+} \;\cdot\; \texttt{@} \;\cdot\;
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
515 |
\texttt{[...]+} \;\cdot\; \texttt{.} \;\cdot\;
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
516 |
\texttt{[...]\{2,6\}}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
517 |
\]
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
518 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
519 |
\noindent
|
471
|
520 |
which is much less readable than the regular expression in
|
|
521 |
\eqref{email}. Similarly for the regular expression that matches the
|
|
522 |
string $hello$ we should write
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
523 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
524 |
\[
|
248
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
525 |
h \cdot e \cdot l \cdot l \cdot o
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
526 |
\]
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
527 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
528 |
\noindent
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
529 |
but often just write {\it hello}.
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
530 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
531 |
If you prefer to think in terms of the implementation
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
532 |
of regular expressions in Scala, the constructors and
|
245
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
533 |
classes relate as follows\footnote{More about Scala is
|
404
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
534 |
in the handout about \emph{A Crash-Course on Scala}.}
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
535 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
536 |
\begin{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
537 |
\begin{tabular}{rcl}
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
538 |
$\ZERO$ & $\mapsto$ & \texttt{ZERO}\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
539 |
$\ONE$ & $\mapsto$ & \texttt{ONE}\\
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
540 |
$c$ & $\mapsto$ & \texttt{CHAR(c)}\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
541 |
$r_1 + r_2$ & $\mapsto$ & \texttt{ALT(r1, r2)}\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
542 |
$r_1 \cdot r_2$ & $\mapsto$ & \texttt{SEQ(r1, r2)}\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
543 |
$r^*$ & $\mapsto$ & \texttt{STAR(r)}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
544 |
\end{tabular}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
545 |
\end{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
546 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
547 |
A source of confusion might arise from the fact that we
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
548 |
use the term \emph{basic regular expression} for the regular
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
549 |
expressions used in ``theory'' and defined above, and
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
550 |
\emph{extended regular expression} for the ones used in
|
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
551 |
``practice'', for example in Scala. If runtime is not an
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
552 |
issue, then the latter can be seen as syntactic sugar of
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
553 |
the former. For example we could replace
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
554 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
555 |
\begin{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
556 |
\begin{tabular}{rcl}
|
471
|
557 |
$r^+$ & $\mapsto$ & $r\cdot r^*$\\
|
|
558 |
$r^?$ & $\mapsto$ & $\ONE + r$\\
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
559 |
$\backslash d$ & $\mapsto$ & $0 + 1 + 2 + \ldots + 9$\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
560 |
$[\text{\it a - z}]$ & $\mapsto$ & $a + b + \ldots + z$\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
561 |
\end{tabular}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
562 |
\end{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
563 |
|
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
564 |
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
565 |
\subsection*{The Meaning of Regular Expressions}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
566 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
567 |
So far we have only considered informally what the
|
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
568 |
\emph{meaning} of a regular expression is. This is not good
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
569 |
enough for specifications of what algorithms are supposed to
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
570 |
do or which problems they are supposed to solve.
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
571 |
|
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
572 |
To define the meaning of a regular expression we will
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
573 |
associate with every regular expression a language, or set of
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
574 |
strings. This language contains all the strings the regular
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
575 |
expression is supposed to match. To understand what is going
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
576 |
on here it is crucial that you have read the handout
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
577 |
about basic mathematical notations.
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
578 |
|
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
579 |
The \defn{meaning of a regular expression} can be defined
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
580 |
by a recursive function called $L$ (for language), which
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
581 |
is defined as follows
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
582 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
583 |
\begin{center}
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
584 |
\begin{tabular}{rcll}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
585 |
$L(\ZERO)$ & $\dn$ & $\{\}$\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
586 |
$L(\ONE)$ & $\dn$ & $\{[]\}$\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
587 |
$L(c)$ & $\dn$ & $\{"c"\}$ & or equivalently $\dn \{[c]\}$\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
588 |
$L(r_1+ r_2)$ & $\dn$ & $L(r_1) \cup L(r_2)$\\
|
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
589 |
$L(r_1 \cdot r_2)$ & $\dn$ & $L(r_1) \,@\, L(r_2)$\\
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
590 |
$L(r^*)$ & $\dn$ & $(L(r))\star$\\
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
591 |
\end{tabular}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
592 |
\end{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
593 |
|
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
594 |
\noindent As a result we can now precisely state what the
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
595 |
meaning, for example, of the regular expression $h \cdot
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
596 |
e \cdot l \cdot l \cdot o$ is, namely
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
597 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
598 |
\[
|
248
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
599 |
L(h \cdot e \cdot l \cdot l \cdot o) = \{"hello"\}
|
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
600 |
\]
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
601 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
602 |
\noindent This is expected because this regular expression
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
603 |
is only supposed to match the ``$hello$''-string. Similarly if
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
604 |
we have the choice-regular-expression $a + b$, its meaning is
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
605 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
606 |
\[
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
607 |
L(a + b) = \{"a", "b"\}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
608 |
\]
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
609 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
610 |
\noindent You can now also see why we do not make a difference
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
611 |
between the different regular expressions $(r_1 + r_2) + r_3$
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
612 |
and $r_1 + (r_2 + r_3)$\ldots they are not the same regular
|
248
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
613 |
expression, but they have the same meaning. For example
|
318
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
614 |
you can do the following calculation which shows they
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
615 |
have the same meaning:
|
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
616 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
617 |
\begin{eqnarray*}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
618 |
L((r_1 + r_2) + r_3) & = & L(r_1 + r_2) \cup L(r_3)\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
619 |
& = & L(r_1) \cup L(r_2) \cup L(r_3)\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
620 |
& = & L(r_1) \cup L(r_2 + r_3)\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
621 |
& = & L(r_1 + (r_2 + r_3))
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
622 |
\end{eqnarray*}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
623 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
624 |
The point of the definition of $L$ is that we can use it to
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
625 |
precisely specify when a string $s$ is matched by a regular
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
626 |
expression $r$, namely if and only if $s \in L(r)$. In fact we
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
627 |
will write a program \pcode{match} that takes any string $s$
|
492
|
628 |
and any regular expression $r$ as arguments and returns
|
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
629 |
\emph{yes}, if $s \in L(r)$ and \emph{no}, if $s \not\in
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
630 |
L(r)$. We leave this for the next lecture.
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
631 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
632 |
There is one more feature of regular expressions that is worth
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
633 |
mentioning. Given some strings, there are in general many
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
634 |
different regular expressions that can recognise these
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
635 |
strings. This is obvious with the regular expression $a + b$
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
636 |
which can match the strings $a$ and $b$. But also the regular
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
637 |
expression $b + a$ would match the same strings. However,
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
638 |
sometimes it is not so obvious whether two regular expressions
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
639 |
match the same strings: for example do $r^*$ and $\ONE + r
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
640 |
\cdot r^*$ match the same strings? What about $\ZERO^*$
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
641 |
and $\ONE^*$? This suggests the following relation between
|
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
642 |
\defn{equivalent regular expressions}:
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
643 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
644 |
\[
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
645 |
r_1 \equiv r_2 \;\dn\; L(r_1) = L(r_2)
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
646 |
\]
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
647 |
|
248
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
648 |
\noindent That means two regular expressions are said to be
|
550
|
649 |
equivalent if they match the same set of strings. That is
|
563
|
650 |
their meanings is the same. Therefore we
|
248
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
651 |
do not really distinguish between the different regular
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
652 |
expression $(r_1 + r_2) + r_3$ and $r_1 + (r_2 + r_3)$,
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
653 |
because they are equivalent. I leave you to the question
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
654 |
whether
|
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
655 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
656 |
\[
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
657 |
\ZERO^* \equiv \ONE^*
|
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
658 |
\]
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
659 |
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
660 |
\noindent holds or not? Such equivalences will be important
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
661 |
for our matching algorithm, because we can use them to
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
662 |
simplify regular expressions, which will mean we can speed
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
663 |
up the calculations.
|
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
664 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
665 |
\subsection*{My Fascination for Regular Expressions}
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
666 |
|
471
|
667 |
Up until a few years ago I was not really interested in regular
|
|
668 |
expressions. They have been studied for the last 60 years (by smarter
|
|
669 |
people than me)---surely nothing new can be found out about them. I
|
|
670 |
even have the vague recollection that I did not quite understand them
|
|
671 |
during my undergraduate study. If I remember correctly,\footnote{That
|
|
672 |
was really a long time ago.} I got utterly confused about $\ONE$
|
|
673 |
(which my lecturer wrote as $\epsilon$) and the empty string (which he
|
|
674 |
also wrote as $\epsilon$).\footnote{Obviously the lecturer must have
|
550
|
675 |
been bad ;o)} Since then, I have used regular expressions for
|
471
|
676 |
implementing lexers and parsers as I have always been interested in
|
|
677 |
all kinds of programming languages and compilers, which invariably
|
|
678 |
need regular expressions in some form or shape.
|
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
679 |
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
680 |
To understand my fascination \emph{nowadays} with regular
|
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
681 |
expressions, you need to know that my main scientific interest
|
471
|
682 |
for the last 17 years has been with theorem provers. I am a
|
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
683 |
core developer of one of
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
684 |
them.\footnote{\url{http://isabelle.in.tum.de}} Theorem
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
685 |
provers are systems in which you can formally reason about
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
686 |
mathematical concepts, but also about programs. In this way
|
550
|
687 |
theorem provers can help with the menacing problem of writing bug-free code. Theorem provers have
|
416
|
688 |
proved already their value in a number of cases (even in
|
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
689 |
terms of hard cash), but they are still clunky and difficult
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
690 |
to use by average programmers.
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
691 |
|
471
|
692 |
Anyway, in about 2011 I came across the notion of \defn{derivatives of
|
|
693 |
regular expressions}. This notion allows one to do almost all
|
|
694 |
calculations with regular expressions on the level of regular
|
|
695 |
expressions, not needing any automata (you will see we only touch
|
|
696 |
briefly on automata in lecture 3). Automata are usually the main
|
|
697 |
object of study in formal language courses. The avoidance of automata
|
550
|
698 |
is crucial for me because automata are graphs and it is rather difficult to
|
471
|
699 |
reason about graphs in theorem provers. In contrast, reasoning about
|
|
700 |
regular expressions is easy-peasy in theorem provers. Is this
|
|
701 |
important? I think yes, because according to Kuklewicz nearly all
|
|
702 |
POSIX-based regular expression matchers are
|
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
703 |
buggy.\footnote{\url{http://www.haskell.org/haskellwiki/Regex_Posix}}
|
471
|
704 |
With my PhD student Fahad Ausaf I proved the correctness for one such
|
|
705 |
matcher that was proposed by Sulzmann and Lu in
|
|
706 |
2014.\footnote{\url{http://goo.gl/bz0eHp}} Hopefully we can prove that
|
|
707 |
the machine code(!) that implements this code efficiently is correct
|
|
708 |
also. Writing programs in this way does not leave any room for
|
|
709 |
potential errors or bugs. How nice!
|
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
710 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
711 |
What also helped with my fascination with regular expressions
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
712 |
is that we could indeed find out new things about them that
|
416
|
713 |
have surprised some experts. Together with two colleagues from China, I was
|
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
714 |
able to prove the Myhill-Nerode theorem by only using regular
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
715 |
expressions and the notion of derivatives. Earlier versions of
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
716 |
this theorem used always automata in the proof. Using this
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
717 |
theorem we can show that regular languages are closed under
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
718 |
complementation, something which Gasarch in his
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
719 |
blog\footnote{\url{http://goo.gl/2R11Fw}} assumed can only be
|
550
|
720 |
shown via automata. So even somebody who has written a 700+-page
|
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
721 |
book\footnote{\url{http://goo.gl/fD0eHx}} on regular
|
550
|
722 |
expressions did not know better. Well, we showed it can also be
|
507
|
723 |
done with regular expressions only.\footnote{\url{http://nms.kcl.ac.uk/christian.urban/Publications/posix.pdf}}
|
471
|
724 |
What a feeling when you are an outsider to the subject!
|
243
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
725 |
|
471
|
726 |
To conclude: Despite my early ignorance about regular expressions, I
|
550
|
727 |
find them now extremely interesting. They have practical importance
|
471
|
728 |
(remember the shocking runtime of the regular expression matchers in
|
|
729 |
Python, Ruby and Java in some instances and the problems in Stack
|
550
|
730 |
Exchange and the Atom editor). They are used in tools like Snort and
|
|
731 |
Bro in order to monitor network traffic. They have a beautiful mathematical
|
|
732 |
theory behind them, which can be sometimes quite deep and which
|
|
733 |
sometimes contains hidden snares. People who are not very familiar
|
|
734 |
with the mathematical background of regular expressions get them
|
492
|
735 |
consistently wrong (this is surprising given they are a supposed to be
|
|
736 |
core skill for computer scientists). The hope is that we can do better
|
|
737 |
in the future---for example by proving that the algorithms actually
|
471
|
738 |
satisfy their specification and that the corresponding implementations
|
|
739 |
do not contain any bugs. We are close, but not yet quite there.
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
740 |
|
332
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
741 |
Notwithstanding my fascination, I am also happy to admit that regular
|
244
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
742 |
expressions have their shortcomings. There are some well-known
|
471
|
743 |
``theoretical'' shortcomings, for example recognising strings of the
|
|
744 |
form $a^{n}b^{n}$ is not possible with regular expressions. This means
|
550
|
745 |
for example if we try to recognise whether parentheses are well-nested
|
492
|
746 |
in an expression is impossible with (basic) regular expressions. I am
|
|
747 |
not so bothered by these shortcomings. What I am bothered about is
|
|
748 |
when regular expressions are in the way of practical programming. For
|
|
749 |
example, it turns out that the regular expression for email addresses
|
|
750 |
shown in \eqref{email} is hopelessly inadequate for recognising all of
|
|
751 |
them (despite being touted as something every computer scientist
|
|
752 |
should know about). The W3 Consortium (which standardises the Web)
|
|
753 |
proposes to use the following, already more complicated regular
|
|
754 |
expressions for email addresses:
|
244
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
755 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
756 |
{\small\begin{lstlisting}[language={},keywordstyle=\color{black},numbers=none]
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
757 |
[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
758 |
\end{lstlisting}}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
759 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
760 |
\noindent But they admit that by using this regular expression
|
471
|
761 |
they wilfully violate the RFC 5322 standard, which specifies
|
244
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
762 |
the syntax of email addresses. With their proposed regular
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
763 |
expression they are too strict in some cases and too lax in
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
764 |
others. Not a good situation to be in. A regular expression
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
765 |
that is claimed to be closer to the standard is shown in
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
766 |
Figure~\ref{monster}. Whether this claim is true or not, I
|
416
|
767 |
would not know---the only thing I can say about this regular
|
248
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
768 |
expression is it is a monstrosity. However, this might
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
769 |
actually be an argument against the RFC standard, rather than
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
770 |
against regular expressions. A similar argument is made in
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
771 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
772 |
\begin{center}
|
570
|
773 |
\url{http://elliot.land/post/its-impossible-to-validate-an-email-address}
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
774 |
\end{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
775 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
776 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
777 |
\noindent which explains some of the crazier parts of email
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
778 |
addresses. Still it is good to know that some tasks in text
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
779 |
processing just cannot be achieved by using regular
|
471
|
780 |
expressions. But for what we want to use them (lexing) they are
|
473
|
781 |
pretty good.\medskip
|
|
782 |
|
|
783 |
\noindent
|
|
784 |
Finally there is a joke about regular expressions:
|
|
785 |
|
|
786 |
\begin{quote}\it
|
|
787 |
``Sometimes you have a programming problem and it seems like the
|
|
788 |
best solution is to use regular expressions; now you have two
|
|
789 |
problems.''
|
|
790 |
\end{quote}
|
244
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
791 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
792 |
|
477
|
793 |
\begin{figure}[p]\small
|
550
|
794 |
\lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
|
477
|
795 |
{../progs/crawler1.scala}
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
796 |
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
797 |
\caption{The Scala code for a simple web-crawler that checks
|
722
|
798 |
for broken links in web-pages. It uses the regular expression
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
799 |
\texttt{http\_pattern} in Line~\ref{httpline} for recognising
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
800 |
URL-addresses. It finds all links using the library function
|
722
|
801 |
\texttt{findAllIn} in Line~\ref{findallline} (this function
|
|
802 |
is part of Scala's regular expression library).\label{crawler1}}
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
803 |
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
804 |
\end{figure}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
805 |
|
722
|
806 |
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
807 |
|
722
|
808 |
%\begin{figure}[p]\small
|
|
809 |
% \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
|
|
810 |
% {../progs/crawler2.scala}
|
|
811 |
%
|
|
812 |
%\caption{A version of the web-crawler that only follows links
|
|
813 |
%in ``my'' domain---since these are the ones I am interested in
|
|
814 |
%to fix. It uses the regular expression \texttt{my\_urls} in
|
|
815 |
%Line~\ref{myurlline} to check for my name in the links. The
|
|
816 |
%main change is in
|
|
817 |
%Lines~\ref{changestartline}--\ref{changeendline} where there
|
|
818 |
%is a test whether URL is in ``my'' domain or
|
|
819 |
%not.\label{crawler2}}
|
|
820 |
%\end{figure}
|
477
|
821 |
|
|
822 |
\begin{figure}[p]\small
|
550
|
823 |
\lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
|
477
|
824 |
{../progs/crawler2.scala}
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
825 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
826 |
\caption{A small email harvester---whenever we download a
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
827 |
web-page, we also check whether it contains any email
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
828 |
addresses. For this we use the regular expression
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
829 |
\texttt{email\_pattern} in Line~\ref{emailline}. The main
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
830 |
change is in Line~\ref{mainline} where all email addresses
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
831 |
that can be found in a page are printed.\label{crawler3}}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
832 |
|
242
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
833 |
\end{figure}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
834 |
|
244
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
835 |
\begin{figure}[p]
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
836 |
\tiny
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
837 |
\begin{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
838 |
\begin{minipage}{0.8\textwidth}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
839 |
\lstinputlisting[language={},keywordstyle=\color{black},numbers=none]{../progs/email-rexp}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
840 |
\end{minipage}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
841 |
\end{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
842 |
|
404
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
843 |
\caption{Nothing that can be said about this regular
|
416
|
844 |
expression\ldots{}except it is a monstrosity.\label{monster}}
|
244
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
845 |
\end{figure}
|
108
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
846 |
|
112
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
847 |
|
105
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
848 |
\end{document}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
849 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
850 |
%%% Local Variables:
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
851 |
%%% mode: latex
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
852 |
%%% TeX-master: t
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
853 |
%%% End:
|