% !TEX program = xelatex
\documentclass{article}
\usepackage{../style}
\usepackage{../langs}
\usepackage{../graphics}
\usepackage{../data}
\usepackage{lstlinebgrd}
\definecolor{capri}{rgb}{0.0, 0.75, 1.0}
%%http://regexcrossword.com/challenges/cities/puzzles/1
%%https://jex.im/regulex/
%%https://www.reddit.com/r/ProgrammingLanguages/comments/42dlem/mona_compiler_development_part_2_parsing/
%%https://www.reddit.com/r/ProgrammingLanguages/comments/43wlkq/formal_grammar_for_csh_tsch_sh_or_bash/
%% regex displayers
%% https://regexper.com/#a%7Ca
%% https://www.debuggex.com
%% https://jex.im/regulex/#!embed=false&flags=&re=%5E(a%7Cb)*%3F%24
%% email validator
%% http://www.ex-parrot.com/%7Epdw/Mail-RFC822-Address.html
% https://jackfoxy.github.io/FsRegEx/emailregex.html
%% regex testers
% https://regex101.com
% http://regexr.com
%% emacs regexes
%% https://www.gnu.org/software/emacs/manual/html_node/elisp/Regular-Expressions.html
%% reasons for a new programming language
%% http://beautifulracket.com
% compiler explorer
% https://gcc.godbolt.org
% good article how languages have been influenced
% 10 MOST(LY DEAD) INFLUENTIAL PROGRAMMING LANGUAGES
% https://www.hillelwayne.com/post/influential-dead-languages/
\begin{document}
\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2018, 2019, 2020}
\section*{Handout 1}
The purpose of a compiler is to transform a program a human can read and
write into code the machine can run as fast as possible. Developing a
compiler is an old craft going back to 1952 with the first compiler
written by Grace Hopper.\footnote{Who many years ago was invited on a
talk show hosted by David Letterman.
\here{https://youtu.be/3N_ywhx6_K0?t=31}} Why studying compilers
nowadays? An interesting answer is given by John Regher in his compiler
blog:\here{http://blog.regehr.org/archives/1419}
\begin{quote}\it{}``We can start off with a couple of observations
about the role of compilers. First, hardware is getting weirder
rather than getting clocked faster: almost all processors are
multicores and it looks like there is increasing asymmetry in
resources across cores. Processors come with vector units, crypto
accelerators, bit twiddling instructions, and lots of features to
make virtualization and concurrency work. We have DSPs, GPUs,
big.little, and Xeon Phi. This is only scratching the
surface. Second, we’re getting tired of low-level languages and
their associated security disasters, we want to write new code, to
whatever extent possible, in safer, higher-level
languages. Compilers are caught right in the middle of these
opposing trends: one of their main jobs is to help bridge the large
and growing gap between increasingly high-level languages and
increasingly wacky platforms. It’s effectively a perpetual
employment act for solid compiler hackers.''
\end{quote}
\noindent
So the goal of this module is to become a solid (beginner) compiler
hacker and as part of the coursework to implement a small
compiler for a very small programming language.
The first part of the module is about the problem of text processing,
which is needed for compilers, but also for dictionaries, DNA-data,
spam-filters and so on. When looking for a particular string, say
\pcode{"foobar"}, in a large text we can use the Knuth-Morris-Pratt
algorithm, which is currently the most efficient general string search
algorithm. But often we do \emph{not} just look for a particular string,
but for string patterns. For example, in program code we need to
identify what are the keywords (\texttt{if}, \texttt{then},
\texttt{while}, \texttt{for}, etc) and what are the identifiers
(variable names). A pattern for identifiers could be stated as: they
start with a letter, followed by zero or more letters, numbers and
underscores.
%You might also be surprised what
%constraints programming languages impose about numbers: for example
%123 in JSON is OK, but 0123 is not.
%
% The regex for JASON numbers is
%
% -?(0|[1-9][0-9]*)(\.[0-9]+)?([eE][+-]?[0-9]+)?
Often we also face the problem that we are given a string, for example
some user input, and we want to know whether it matches a particular
pattern---is it an email address, for example. In this way we can
exclude user input that would otherwise have nasty effects on our
program (crashing it or making it go into an infinite loop, if not
worse). This kind of ``inspecting'' mechanism is also implemented in
popular network security tools such as Snort and
Bro.\here{www.snort.org}\here{www.bro.org} They scan incoming
network traffic for computer viruses or malicious packets. Similarly
filtering out spam usually involves looking for some signature
(essentially a string pattern). The point is that the fast
Knuth-Morris-Pratt algorithm for strings is not good enough for such
string \emph{patterns}.\smallskip
\defn{Regular expressions} help with conveniently specifying
such patterns. The idea behind regular expressions is that
they are a simple method for describing languages (or sets of
strings)\ldots at least languages we are interested in in
computer science. For example there is no convenient regular
expression for describing the English language short of
enumerating all English words. But they seem useful for
describing for example simple email addresses.\footnote{See
``8 Regular Expressions You Should Know''
\url{http://goo.gl/5LoVX7}} Consider the following regular
expression
\begin{equation}\label{email}
\texttt{[a-z0-9\_.-]+} \;\;\texttt{@}\;\; \texttt{[a-z0-9.-]+} \;\;\texttt{.}\;\; \texttt{[a-z.]\{2,6\}}
\end{equation}
\noindent where the first part, the user name, matches one or more
lowercase letters (\pcode{a-z}), digits (\pcode{0-9}), underscores, dots
and hyphens. The \pcode{+} at the end of the brackets ensures the ``one
or more''. Then comes the email \pcode{@}-sign, followed by the domain
name which must be one or more lowercase letters, digits, underscores,
dots or hyphens (but no underscores). Finally there must be a dot
followed by the toplevel domain. This toplevel domain must be 2 to 6
lowercase letters including the dot. Example strings which follow this
pattern are:
\begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}]
niceandsimple@example.org
very.common@example.co.uk
a.little.lengthy.but.fine@dept.example.ac.uk
other.email-with-dash@example.edu
\end{lstlisting}
\noindent
But for example the following two do not
\begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}]
user@localserver
disposable.style.email.with+symbol@example.com
\end{lstlisting}
\noindent according to the regular expression we specified in line
\eqref{email} above. Whether this is intended or not is a different
question (the second email above is actually an acceptable email
address according to the RFC 5322 standard for email addresses).
As mentioned above, identifiers, or variables, in program code
are often required to satisfy the constraints that they start
with a letter and then can be followed by zero or more letters
or numbers and also can include underscores, but not as the
first character. Such identifiers can be recognised with the
regular expression
\begin{center}
\pcode{[a-zA-Z] [a-zA-Z0-9_]*}
\end{center}
\noindent Possible identifiers that match this regular expression
are \pcode{x}, \pcode{foo}, \pcode{foo_bar_1}, \pcode{A_very_42_long_object_name},
but not \pcode{_i} and also not \pcode{4you}.
Many programming languages offer libraries that can be used to
validate such strings against regular expressions. Also there
are some common, and I am sure very familiar, ways of how to
construct regular expressions. For example in Scala we have
a library implementing the following regular expressions:
\begin{center}
\begin{tabular}{lp{9cm}}
\pcode{re*} & matches 0 or more occurrences of preceding
expression\\
\pcode{re+} & matches 1 or more occurrences of preceding
expression\\
\pcode{re?} & matches 0 or 1 occurrence of preceding
expression\\
\pcode{re\{n\}} & matches exactly \pcode{n} number of
occurrences of preceding expression\\
\pcode{re\{n,m\}} & matches at least \pcode{n} and at most {\tt m}
occurrences of the preceding expression\\
\pcode{[...]} & matches any single character inside the
brackets\\
\pcode{[^...]} & matches any single character not inside the
brackets\\
\pcode{...-...} & character ranges\\
\pcode{\\d} & matches digits; equivalent to \pcode{[0-9]}\\
\pcode{.} & matches every character except newline\\
\pcode{(re)} & groups regular expressions and remembers
matched text
\end{tabular}
\end{center}
\noindent With this table you can figure out the purpose of the regular
expressions in the web-crawlers shown Figures \ref{crawler1} and
\ref{crawler3}. In Figure~\ref{crawler1}, however, be careful with
the regular expression for http-addresses in Line~\ref{httpline}. It is
intended to be
\[
\pcode{"https?://[^"]*"}
\]
\noindent specifying that http-addresses need to start with a double
quote, then comes \texttt{http} followed by an optional \texttt{s} and
so on\ldots{}until the closing double quote comes at the end of the
address. Normally we would have to escape the double quotes in order to
make sure we interpret the double quote as character, not as double
quote for a string. But Scala's trick with triple quotes allows us to
omit this kind of ugly escaping. As a result we can just write:
\[
\pcode{""""https?://[^"]*"""".r}
\]
\noindent The convention in Scala is that \texttt{.r} converts a string
into a regular expression. I leave it to you to ponder whether this
regular expression really captures all possible web-addresses. If you
need a quick recap about regular expressions and how the match strings,
here is a quick video: \url{https://youtu.be/bgBWp9EIlMM}.
\subsection*{Why Study Regular Expressions?}
Regular expressions were introduced by Kleene in the 1950ies and they
have been object of intense study since then. They are nowadays pretty
much ubiquitous in computer science. There are many libraries
implementing regular expressions. I am sure you have come across them
before (remember the PRA or PEP modules?).
Why on earth then is there any interest in studying them again in depth
in this module? Well, one answer is in the following two graphs about
regular expression matching in Python, Ruby, JavaScript and Java
(Version 8).
\begin{center}
\begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1.5mm}}c@{}}
\begin{tikzpicture}
\begin{axis}[
title={Graph: $\texttt{(a*)*\,b}$ and strings
$\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
xlabel={$n$},
x label style={at={(1.05,0.0)}},
ylabel={time in secs},
enlargelimits=false,
xtick={0,5,...,30},
xmax=33,
ymax=35,
ytick={0,5,...,30},
scaled ticks=false,
axis lines=left,
width=5.5cm,
height=4.5cm,
legend entries={Python, Java 8, JavaScript},
legend pos=north west,
legend cell align=left]
\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
\addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
\end{axis}
\end{tikzpicture}
&
\begin{tikzpicture}
\begin{axis}[
title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings
$\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
xlabel={$n$},
x label style={at={(1.05,0.0)}},
ylabel={time in secs},
enlargelimits=false,
xtick={0,5,...,30},
xmax=33,
ymax=35,
ytick={0,5,...,30},
scaled ticks=false,
axis lines=left,
width=5.5cm,
height=4.5cm,
legend entries={Python,Ruby},
legend pos=north west,
legend cell align=left]
\addplot[blue,mark=*, mark options={fill=white}] table {re-python.data};
\addplot[brown,mark=triangle*, mark options={fill=white}] table {re-ruby.data};
\end{axis}
\end{tikzpicture}
\end{tabular}
\end{center}
\noindent This first graph shows that Python, JavaScript and Java 8 need
approximately 30 seconds to find out that the regular expression
$\texttt{(a*)*\,b}$ does not match strings of 28 \texttt{a}s. Similarly,
the second shows that Python and Ruby need approximately 29 seconds for finding
out whether a string of 28 \texttt{a}s matches the regular expression
\texttt{a?\{28\}\,a\{28\}}.\footnote{In this
example Ruby uses actually the slightly different regular expression
\texttt{a?a?a?...a?a?aaa...aa}, where the \texttt{a?} and \texttt{a}
each occur $n$ times. More such test cases can be found at
\url{https://www.owasp.org/index.php/Regular_expression_Denial_of_Service_-_ReDoS}.}
Admittedly, these regular expressions are carefully chosen to exhibit
this exponential behaviour, but similar ones occur more often than one
wants in ``real life''. For example, on 20 July 2016 a similar regular
expression brought the webpage \href{http://stackexchange.com}{Stack
Exchange} to its knees:
\begin{center}
\url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}
\end{center}
\noindent I can also highly recommend a cool webtalk from an engineer
from Stack Exchange on the same subject:
\begin{center}
\url{https://vimeo.com/112065252}
\end{center}
\noindent
A similar problem also occurred in the Atom editor:
\begin{center}
\url{http://davidvgalbraith.com/how-i-fixed-atom/}
\end{center}
\noindent
and also when somebody tried to match web-addresses using a
relatively simple regular expression
\begin{center}
\url{https://www.tutorialdocs.com/article/regex-trap.html}
\end{center}
\noindent
Finally, on 2 July 2019 Cloudflare had a global outage because of a
regular expression (they had no outage for the last 6 years). What
happened is nicely explained in the blog:
\begin{center}
\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019}
\end{center}
Such troublesome regular expressions are sometimes called \emph{evil
regular expressions} because they have the potential to make regular
expression matching engines to topple over, like in Python, Ruby,
JavaScript and Java 8. This ``toppling over'' is also sometimes called
\emph{catastrophic backtracking}. I have also seen the term
\emph{eternal matching} used for this. The problem with evil regular
expressions and catastrophic backtracking is that they can have some
serious consequences, for example, if you use them in your
web-application. The reason is that hackers can look for these instances
where the matching engine behaves badly and then mount a nice DoS-attack
against your application. These attacks are already have their own name:
\emph{Regular Expression Denial of Service Attacks (ReDoS)}.
It will be instructive to look behind the ``scenes'' to find out why
Python and Ruby (and others) behave so badly when matching strings with
evil regular expressions. But we will also look at a relatively simple
algorithm that solves this problem much better than Python and Ruby
do\ldots actually it will be two versions of the algorithm: the first
one will be able in the example \texttt{a?\{n\}\,a\{n\}} to process strings of
approximately 1,100 \texttt{a}s in 23 seconds, while the second version
will even be able to process up to 11,000(!) in 5 seconds, see the graph
below:
\begin{center}
\begin{tikzpicture}
\begin{axis}[
title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings
$\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
xlabel={$n$},
x label style={at={(1.05,0.0)}},
ylabel={time in secs},
enlargelimits=false,
xtick={0,3000,...,12000},
xmax=13000,
ymax=32,
ytick={0,5,...,30},
scaled ticks=false,
axis lines=left,
width=7cm,
height=4.4cm,
legend entries={Our Algorithm V1, Our Algorithm V2},
legend pos=outer north east]
\addplot[green,mark=square*,mark options={fill=white}] table {re2.data};
\addplot[black,mark=square*,mark options={fill=white}] table {re3.data};
\end{axis}
\end{tikzpicture}
\end{center}
\noindent And in the case of the regular expression $\texttt{(a*)*\,b}$
and strings of \texttt{a}s we will beat Java 8 by factor of
approximately 1,000,000 (note the scale on the $x$-axis).
\begin{center}
\begin{tikzpicture}
\begin{axis}[
title={Graph: $\texttt{(a*)*\,b}$ and strings
$\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
xlabel={$n$},
x label style={at={(1.05,0.0)}},
ylabel={time in secs},
enlargelimits=false,
ymax=32,
ytick={0,5,...,30},
axis lines=left,
width=7cm,
height=4.5cm,
legend entries={Our Algorithm V2},
legend pos=outer north east]
\addplot[black,mark=square*,mark options={fill=white}] table {re3a.data};
\end{axis}
\end{tikzpicture}
\end{center}
\noindent
You might have wondered above why I looked at the (now) old Java 8: the
reason is that Java 9 and later versions are a bit better, but we will
still beat them hands down with our regex matcher.
\subsection*{Basic Regular Expressions}
The regular expressions shown earlier for Scala, we
will call \emph{extended regular expressions}. The ones we
will mainly study in this module are \emph{basic regular
expressions}, which by convention we will just call
\emph{regular expressions}, if it is clear what we mean. The
attraction of (basic) regular expressions is that many
features of the extended ones are just syntactic sugar.
(Basic) regular expressions are defined by the following
grammar:
\begin{center}
\begin{tabular}{r@{\hspace{1mm}}r@{\hspace{1mm}}l@{\hspace{13mm}}l}
$r$ & $::=$ & $\ZERO$ & null language\\
& $\mid$ & $\ONE$ & empty string / \texttt{""} / []\\
& $\mid$ & $c$ & single character\\
& $\mid$ & $r_1 + r_2$ & alternative / choice\\
& $\mid$ & $r_1 \cdot r_2$ & sequence\\
& $\mid$ & $r^*$ & star (zero or more)\\
\end{tabular}
\end{center}
\noindent Because we overload our notation, there are some
subtleties you should be aware of. When regular expressions
are referred to, then $\ZERO$ (in bold font) does not stand for
the number zero: rather it is a particular pattern that does
not match any string. Similarly, in the context of regular
expressions, $\ONE$ does not stand for the number one but for
a regular expression that matches the empty string. The letter
$c$ stands for any character from the alphabet at hand. Again
in the context of regular expressions, it is a particular
pattern that can match the specified character. You should
also be careful with our overloading of the star: assuming you
have read the handout about our basic mathematical notation,
you will see that in the context of languages (sets of
strings) the star stands for an operation on languages. Here
$r^*$ stands for a regular expression, which is different from
the operation on sets is defined as
\[
A\star\dn \bigcup_{0\le n} A^n
\]
We will use parentheses to disambiguate regular expressions.
Parentheses are not really part of a regular expression, and
indeed we do not need them in our code because there the tree
structure of regular expressions is always clear. But for
writing them down in a more mathematical fashion, parentheses
will be helpful. For example we will write $(r_1 + r_2)^*$,
which is different from, say $r_1 + (r_2)^*$. This can be
seen if we write regular expressions as trees:
\begin{center}
\includegraphics[scale=0.65]{../pics/r1.pdf}
\hspace{3cm}
\includegraphics[scale=0.65]{../pics/r2.pdf}
\end{center}
\noindent
The regular expression on the left means
roughly zero or more times $r_1$ or $r_2$, while the one on the right
means $r_1$, or zero or more times $r_2$. This will turn out to
be two different patterns, which match in general different
strings. We should also write $(r_1 + r_2) + r_3$, which is
different from the regular expression $r_1 + (r_2 + r_3)$, but
in case of $+$ and $\cdot$ we actually do not care about the
order and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2
\cdot r_3$, respectively. The reasons for this carelessness will become
clear shortly.
In the literature you will often find that the choice $r_1 +
r_2$ is written as $r_1\mid{}r_2$ or $r_1\mid\mid{}r_2$. Also,
often our $\ZERO$ and $\ONE$ are written $\varnothing$ and
$\epsilon$, respectively. Following the convention in the
literature, we will often omit the $\cdot$. This
is to make some concrete regular expressions more readable.
For example the regular expression for email addresses shown
in \eqref{email} would fully expanded look like
\[
\texttt{[...]+} \;\cdot\; \texttt{@} \;\cdot\;
\texttt{[...]+} \;\cdot\; \texttt{.} \;\cdot\;
\texttt{[...]\{2,6\}}
\]
\noindent
which is much less readable than the regular expression in
\eqref{email}. Similarly for the regular expression that matches the
string $hello$ we should write
\[
h \cdot e \cdot l \cdot l \cdot o
\]
\noindent
but often just write {\it hello}.
If you prefer to think in terms of the implementation
of regular expressions in Scala, the constructors and
classes relate as follows\footnote{More about Scala is
in the handout about \emph{A Crash-Course on Scala}.}
\begin{center}
\begin{tabular}{rcl}
$\ZERO$ & $\mapsto$ & \texttt{ZERO}\\
$\ONE$ & $\mapsto$ & \texttt{ONE}\\
$c$ & $\mapsto$ & \texttt{CHAR(c)}\\
$r_1 + r_2$ & $\mapsto$ & \texttt{ALT(r1, r2)}\\
$r_1 \cdot r_2$ & $\mapsto$ & \texttt{SEQ(r1, r2)}\\
$r^*$ & $\mapsto$ & \texttt{STAR(r)}
\end{tabular}
\end{center}
A source of confusion might arise from the fact that we
use the term \emph{basic regular expression} for the regular
expressions used in ``theory'' and defined above, and
\emph{extended regular expression} for the ones used in
``practice'', for example in Scala. If runtime is not an
issue, then the latter can be seen as syntactic sugar of
the former. For example we could replace
\begin{center}
\begin{tabular}{rcl}
$r^+$ & $\mapsto$ & $r\cdot r^*$\\
$r^?$ & $\mapsto$ & $\ONE + r$\\
$\backslash d$ & $\mapsto$ & $0 + 1 + 2 + \ldots + 9$\\
$[\text{\it a - z}]$ & $\mapsto$ & $a + b + \ldots + z$\\
\end{tabular}
\end{center}
\subsection*{The Meaning of Regular Expressions}
So far we have only considered informally what the
\emph{meaning} of a regular expression is. This is not good
enough for specifications of what algorithms are supposed to
do or which problems they are supposed to solve.
To define the meaning of a regular expression we will
associate with every regular expression a language, or set of
strings. This language contains all the strings the regular
expression is supposed to match. To understand what is going
on here it is crucial that you have read the handout
about basic mathematical notations.
The \defn{meaning of a regular expression} can be defined
by a recursive function called $L$ (for language), which
is defined as follows
\begin{center}
\begin{tabular}{rcll}
$L(\ZERO)$ & $\dn$ & $\{\}$\\
$L(\ONE)$ & $\dn$ & $\{[]\}$\\
$L(c)$ & $\dn$ & $\{"c"\}$ & or equivalently $\dn \{[c]\}$\\
$L(r_1+ r_2)$ & $\dn$ & $L(r_1) \cup L(r_2)$\\
$L(r_1 \cdot r_2)$ & $\dn$ & $L(r_1) \,@\, L(r_2)$\\
$L(r^*)$ & $\dn$ & $(L(r))\star$\\
\end{tabular}
\end{center}
\noindent As a result we can now precisely state what the
meaning, for example, of the regular expression $h \cdot
e \cdot l \cdot l \cdot o$ is, namely
\[
L(h \cdot e \cdot l \cdot l \cdot o) = \{"hello"\}
\]
\noindent This is expected because this regular expression
is only supposed to match the ``$hello$''-string. Similarly if
we have the choice-regular-expression $a + b$, its meaning is
\[
L(a + b) = \{"a", "b"\}
\]
\noindent You can now also see why we do not make a difference
between the different regular expressions $(r_1 + r_2) + r_3$
and $r_1 + (r_2 + r_3)$\ldots they are not the same regular
expression, but they have the same meaning. For example
you can do the following calculation which shows they
have the same meaning:
\begin{eqnarray*}
L((r_1 + r_2) + r_3) & = & L(r_1 + r_2) \cup L(r_3)\\
& = & L(r_1) \cup L(r_2) \cup L(r_3)\\
& = & L(r_1) \cup L(r_2 + r_3)\\
& = & L(r_1 + (r_2 + r_3))
\end{eqnarray*}
The point of the definition of $L$ is that we can use it to
precisely specify when a string $s$ is matched by a regular
expression $r$, namely if and only if $s \in L(r)$. In fact we
will write a program \pcode{match} that takes any string $s$
and any regular expression $r$ as arguments and returns
\emph{yes}, if $s \in L(r)$ and \emph{no}, if $s \not\in
L(r)$. We leave this for the next lecture.
There is one more feature of regular expressions that is worth
mentioning. Given some strings, there are in general many
different regular expressions that can recognise these
strings. This is obvious with the regular expression $a + b$
which can match the strings $a$ and $b$. But also the regular
expression $b + a$ would match the same strings. However,
sometimes it is not so obvious whether two regular expressions
match the same strings: for example do $r^*$ and $\ONE + r
\cdot r^*$ match the same strings? What about $\ZERO^*$
and $\ONE^*$? This suggests the following relation between
\defn{equivalent regular expressions}:
\[
r_1 \equiv r_2 \;\dn\; L(r_1) = L(r_2)
\]
\noindent That means two regular expressions are said to be
equivalent if they match the same set of strings. That is
their meanings is the same. Therefore we
do not really distinguish between the different regular
expression $(r_1 + r_2) + r_3$ and $r_1 + (r_2 + r_3)$,
because they are equivalent. I leave you to the question
whether
\[
\ZERO^* \equiv \ONE^*
\]
\noindent holds or not? Such equivalences will be important
for our matching algorithm, because we can use them to
simplify regular expressions, which will mean we can speed
up the calculations.
\subsection*{My Fascination for Regular Expressions}
Up until a few years ago I was not really interested in regular
expressions. They have been studied for the last 60 years (by smarter
people than me)---surely nothing new can be found out about them. I
even have the vague recollection that I did not quite understand them
during my undergraduate study. If I remember correctly,\footnote{That
was really a long time ago.} I got utterly confused about $\ONE$
(which my lecturer wrote as $\epsilon$) and the empty string (which he
also wrote as $\epsilon$).\footnote{Obviously the lecturer must have
been bad ;o)} Since then, I have used regular expressions for
implementing lexers and parsers as I have always been interested in
all kinds of programming languages and compilers, which invariably
need regular expressions in some form or shape.
To understand my fascination \emph{nowadays} with regular
expressions, you need to know that my main scientific interest
for the last 17 years has been with theorem provers. I am a
core developer of one of
them.\footnote{\url{http://isabelle.in.tum.de}} Theorem
provers are systems in which you can formally reason about
mathematical concepts, but also about programs. In this way
theorem provers can help with the menacing problem of writing bug-free code. Theorem provers have
proved already their value in a number of cases (even in
terms of hard cash), but they are still clunky and difficult
to use by average programmers.
Anyway, in about 2011 I came across the notion of \defn{derivatives of
regular expressions}. This notion allows one to do almost all
calculations with regular expressions on the level of regular
expressions, not needing any automata (you will see we only touch
briefly on automata in lecture 3). Automata are usually the main
object of study in formal language courses. The avoidance of automata
is crucial for me because automata are graphs and it is rather difficult to
reason about graphs in theorem provers. In contrast, reasoning about
regular expressions is easy-peasy in theorem provers. Is this
important? I think yes, because according to Kuklewicz nearly all
POSIX-based regular expression matchers are
buggy.\footnote{\url{http://www.haskell.org/haskellwiki/Regex_Posix}}
With my PhD student Fahad Ausaf I proved the correctness for one such
matcher that was proposed by Sulzmann and Lu in
2014.\footnote{\url{http://goo.gl/bz0eHp}} Hopefully we can prove that
the machine code(!) that implements this code efficiently is correct
also. Writing programs in this way does not leave any room for
potential errors or bugs. How nice!
What also helped with my fascination with regular expressions
is that we could indeed find out new things about them that
have surprised some experts. Together with two colleagues from China, I was
able to prove the Myhill-Nerode theorem by only using regular
expressions and the notion of derivatives. Earlier versions of
this theorem used always automata in the proof. Using this
theorem we can show that regular languages are closed under
complementation, something which Gasarch in his
blog\footnote{\url{http://goo.gl/2R11Fw}} assumed can only be
shown via automata. So even somebody who has written a 700+-page
book\footnote{\url{http://goo.gl/fD0eHx}} on regular
expressions did not know better. Well, we showed it can also be
done with regular expressions only.\footnote{\url{http://nms.kcl.ac.uk/christian.urban/Publications/posix.pdf}}
What a feeling when you are an outsider to the subject!
To conclude: Despite my early ignorance about regular expressions, I
find them now extremely interesting. They have practical importance
(remember the shocking runtime of the regular expression matchers in
Python, Ruby and Java in some instances and the problems in Stack
Exchange and the Atom editor). They are used in tools like Snort and
Bro in order to monitor network traffic. They have a beautiful mathematical
theory behind them, which can be sometimes quite deep and which
sometimes contains hidden snares. People who are not very familiar
with the mathematical background of regular expressions get them
consistently wrong (this is surprising given they are a supposed to be
core skill for computer scientists). The hope is that we can do better
in the future---for example by proving that the algorithms actually
satisfy their specification and that the corresponding implementations
do not contain any bugs. We are close, but not yet quite there.
Notwithstanding my fascination, I am also happy to admit that regular
expressions have their shortcomings. There are some well-known
``theoretical'' shortcomings, for example recognising strings of the
form $a^{n}b^{n}$ is not possible with regular expressions. This means
for example if we try to recognise whether parentheses are well-nested
in an expression is impossible with (basic) regular expressions. I am
not so bothered by these shortcomings. What I am bothered about is
when regular expressions are in the way of practical programming. For
example, it turns out that the regular expression for email addresses
shown in \eqref{email} is hopelessly inadequate for recognising all of
them (despite being touted as something every computer scientist
should know about). The W3 Consortium (which standardises the Web)
proposes to use the following, already more complicated regular
expressions for email addresses:
{\small\begin{lstlisting}[language={},keywordstyle=\color{black},numbers=none]
[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*
\end{lstlisting}}
\noindent But they admit that by using this regular expression
they wilfully violate the RFC 5322 standard, which specifies
the syntax of email addresses. With their proposed regular
expression they are too strict in some cases and too lax in
others. Not a good situation to be in. A regular expression
that is claimed to be closer to the standard is shown in
Figure~\ref{monster}. Whether this claim is true or not, I
would not know---the only thing I can say about this regular
expression is it is a monstrosity. However, this might
actually be an argument against the RFC standard, rather than
against regular expressions. A similar argument is made in
\begin{center}
\url{http://elliot.land/post/its-impossible-to-validate-an-email-address}
\end{center}
\noindent which explains some of the crazier parts of email
addresses. Still it is good to know that some tasks in text
processing just cannot be achieved by using regular
expressions. But for what we want to use them (lexing) they are
pretty good.\medskip
\noindent
Finally there is a joke about regular expressions:
\begin{quote}\it
``Sometimes you have a programming problem and it seems like the
best solution is to use regular expressions; now you have two
problems.''
\end{quote}
\begin{figure}[p]\small
\lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
{../progs/crawler1.scala}
\caption{The Scala code for a simple web-crawler that checks
for broken links in web-pages. It uses the regular expression
\texttt{http\_pattern} in Line~\ref{httpline} for recognising
URL-addresses. It finds all links using the library function
\texttt{findAllIn} in Line~\ref{findallline} (this function
is part of Scala's regular expression library).\label{crawler1}}
\end{figure}
%\begin{figure}[p]\small
% \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
% {../progs/crawler2.scala}
%
%\caption{A version of the web-crawler that only follows links
%in ``my'' domain---since these are the ones I am interested in
%to fix. It uses the regular expression \texttt{my\_urls} in
%Line~\ref{myurlline} to check for my name in the links. The
%main change is in
%Lines~\ref{changestartline}--\ref{changeendline} where there
%is a test whether URL is in ``my'' domain or
%not.\label{crawler2}}
%\end{figure}
\begin{figure}[p]\small
\lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
{../progs/crawler2.scala}
\caption{A small email harvester---whenever we download a
web-page, we also check whether it contains any email
addresses. For this we use the regular expression
\texttt{email\_pattern} in Line~\ref{emailline}. The main
change is in Line~\ref{mainline} where all email addresses
that can be found in a page are printed.\label{crawler3}}
\end{figure}
\begin{figure}[p]
\tiny
\begin{center}
\begin{minipage}{0.8\textwidth}
\lstinputlisting[language={},keywordstyle=\color{black},numbers=none]{../progs/email-rexp}
\end{minipage}
\end{center}
\caption{Nothing that can be said about this regular
expression\ldots{}except it is a monstrosity.\label{monster}}
\end{figure}
\end{document}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: t
%%% End: