\documentclass{article}
\usepackage{../style}
\usepackage{../langs}
\usepackage{../graphics}
\usepackage{../data}
\usepackage{longtable}
\begin{document}
\section*{Handout 1}
This module is about text processing, be it for web-crawlers,
compilers, dictionaries, DNA-data and so on. When looking for
a particular string in a large text we can use the
Knuth-Morris-Pratt algorithm, which is currently the most
efficient general string search algorithm. But often we do
\emph{not} look for just a particular string, but for string
patterns. For example in programming code we need to identify
what are the keywords, what are the identifiers etc. Also
often we face the problem that we are given a string (for
example some user input) and want to know whether it matches a
particular pattern. For example for excluding some user input
that would otherwise have nasty effects on our program
(crashing or going into an infinite loop, if not worse).
\defn{Regular expressions} help with conveniently specifying
such patterns.
The idea behind regular expressions is that they are a simple
method for describing languages (or sets of strings)...at
least languages we are interested in in computer science. For
example there is no convenient regular expression for
describing the English language short of enumerating all
English words. But they seem useful for describing for example
email addresses.\footnote{See ``8 Regular Expressions You
Should Know'' \url{http://goo.gl/5LoVX7}} Consider the
following regular expression
\begin{equation}\label{email}
\texttt{[a-z0-9\_.-]+ @ [a-z0-9.-]+ . [a-z.]\{2,6\}}
\end{equation}
\noindent where the first part matches one or more lowercase
letters (\pcode{a-z}), digits (\pcode{0-9}), underscores, dots
or hyphens. The \pcode{+} ensures the ``one or more''. Then
comes the \pcode{@}-sign, followed by the domain name which
must be one or more lowercase letters, digits, underscores,
dots or hyphens. Note there cannot be an underscore in the
domain name. Finally there must be a dot followed by the
toplevel domain. This toplevel domain must be 2 to 6 lowercase
letters including the dot. Example strings which follow this
pattern are:
\begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}]
niceandsimple@example.com
very.common@example.org
a.little.lengthy.but.fine@dept.example.co.uk
other.email-with-dash@example.ac.uk
\end{lstlisting}
\noindent
But for example the following two do not:
\begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}]
user@localserver
disposable.style.email.with+symbol@example.com
\end{lstlisting}
Many programming language offer libraries that can be used to
validate such strings against regular expressions, like the
one for email addresses in \eqref{email}. There are some
common, and I am sure very familiar, ways how to construct
regular expressions. For example in Scala we have
\begin{center}
\begin{longtable}{lp{9cm}}
\pcode{re*} & matches 0 or more occurrences of preceding
expression\\
\pcode{re+} & matches 1 or more occurrences of preceding
expression\\
\pcode{re?} & matches 0 or 1 occurrence of preceding
expression\\
\pcode{re\{n\}} & matches exactly \pcode{n} number of
occurrences\\
\pcode{re\{n,m\}} & matches at least \pcode{n} and at most {\tt m}
occurences of the preceding expression\\
\pcode{[...]} & matches any single character inside the
brackets\\
\pcode{[^...]} & matches any single character not inside the
brackets\\
\pcode{..-..} & character ranges\\
\pcode{\\d} & matches digits; equivalent to \pcode{[0-9]}
\end{longtable}
\end{center}
\noindent With this you can figure out the purpose of the
regular expressions in the web-crawlers shown Figures
\ref{crawler1}, \ref{crawler2} and \ref{crawler3}. Note the
regular expression for http-addresses in web-pages:
\[
\pcode{"https?://[^"]*"}
\]
\noindent It specifies that web-addresses need to start with a
double quote, then comes \texttt{http} followed by an optional
\texttt{s} and so on. Usually we would have to escape the
double quotes in order to make sure we interpret the double
quote as character, not as double quote for a string. But
Scala's trick with triple quotes allows us to omit this kind
of escaping. As a result we can just write:
\[
\pcode{""""https?://[^"]*"""".r}
\]
\noindent Not also that the convention in Scala is that
\texttt{.r} converts a string into a regular expression. I
leave it to you to ponder whether this regular expression
really captures all possible web-addresses.\bigskip
Regular expressions were introduced by Kleene in the 1950ies
and they have been object of intense study since then. They
are nowadays pretty much ubiquitous in computer science. I am
sure you have come across them before. Why on earth then is
there any interest in studying them again in depth in this
module? Well, one answer is in the following graph about
regular expression matching in Python and in Ruby.
\begin{center}
\begin{tikzpicture}[y=.1cm, x=.15cm]
%axis
\draw (0,0) -- coordinate (x axis mid) (30,0);
\draw (0,0) -- coordinate (y axis mid) (0,30);
%ticks
\foreach \x in {0,5,...,30}
\draw (\x,1pt) -- (\x,-3pt) node[anchor=north] {\x};
\foreach \y in {0,5,...,30}
\draw (1pt,\y) -- (-3pt,\y) node[anchor=east] {\y};
%labels
\node[below=0.6cm] at (x axis mid) {number of \texttt{a}s};
\node[rotate=90,above=0.8cm] at (y axis mid) {time in secs};
%plots
\draw[color=blue] plot[mark=*]
file {re-python.data};
\draw[color=brown] plot[mark=triangle*]
file {re-ruby.data};
%legend
\begin{scope}[shift={(4,20)}]
\draw[color=blue] (0,0) --
plot[mark=*] (0.25,0) -- (0.5,0)
node[right]{\small Python};
\draw[yshift=-\baselineskip, color=brown] (0,0) --
plot[mark=triangle*] (0.25,0) -- (0.5,0)
node[right]{\small Ruby};
\end{scope}
\end{tikzpicture}
\end{center}
\noindent This graph shows that Python needs approximately 29
seconds in order to find out that a string of 28 \texttt{a}s
matches the regular expression \texttt{[a?]\{28\}[a]\{28\}}.
Ruby is even slightly worse.\footnote{In this example Ruby
uses the slightly different regular expression
\texttt{a?a?a?...a?a?aaa...aa}, where the \texttt{a?} and
\texttt{a} each occur $n$ times.} Admittedly, this regular
expression is carefully chosen to exhibit this exponential
behaviour, but similar ones occur more often than one wants in
``real life''. They are sometimes called \emph{evil regular
expressions} because they have the potential to make regular
expression matching engines topple over, like in Python and
Ruby. The problem is that this can have some serious
consequences, for example, if you use them in your
web-application, because hackers can look for these instances
where the matching engine behaves badly and mount a nice
DoS-attack against your application.
It will be instructive to look behind the ``scenes''to find
out why Python and Ruby (and others) behave so badly when
matching with evil regular expressions. But we will also look
at a relatively simple algorithm that solves this problem much
better than Python and Ruby do\ldots actually it will be two
versions of the algorithm: the first one will be able to
process strings of approximately 1,000 \texttt{a}s in 30
seconds, while the second version will even be able to
process up to 12,000 in less than 10(!) seconds, see the graph
below:
\begin{center}
\begin{tikzpicture}[y=.1cm, x=.0006cm]
%axis
\draw (0,0) -- coordinate (x axis mid) (12000,0);
\draw (0,0) -- coordinate (y axis mid) (0,30);
%ticks
\foreach \x in {0,2000,...,12000}
\draw (\x,1pt) -- (\x,-3pt) node[anchor=north] {\x};
\foreach \y in {0,5,...,30}
\draw (1pt,\y) -- (-3pt,\y) node[anchor=east] {\y};
%labels
\node[below=0.6cm] at (x axis mid) {number of \texttt{a}s};
\node[rotate=90, above=0.8cm] at (y axis mid) {time in secs};
%plots
\draw[color=green] plot[mark=square*, mark options={fill=white} ]
file {re2b.data};
\draw[color=black] plot[mark=square*, mark options={fill=white} ]
file {re3.data};
\end{tikzpicture}
\end{center}
\subsection*{Basic Regular Expressions}
The regular expressions shown above we will call
\defn{extended regular expressions}. The ones we will mainly
study are \emph{basic regular expressions}, which by
convention we will just call regular expressions, if it is
clear what we mean. The attraction of (basic) regular
expressions is that many features of the extended one are just
syntactic suggar. (Basic) regular expressions are defined by
the following grammar:
\begin{center}
\begin{tabular}{r@{\hspace{1mm}}r@{\hspace{1mm}}l@{\hspace{13mm}}l}
$r$ & $::=$ & $\varnothing$ & null\\
& $\mid$ & $\epsilon$ & empty string / "" / []\\
& $\mid$ & $c$ & single character\\
& $\mid$ & $r_1 \cdot r_2$ & sequence\\
& $\mid$ & $r_1 + r_2$ & alternative / choice\\
& $\mid$ & $r^*$ & star (zero or more)\\
\end{tabular}
\end{center}
\noindent Because we overload our notation, there are some
subtleties you should be aware of. First, when regular
expressions are referred to then $\varnothing$ does not stand
for the empty set: it is a particular pattern that does not
match any string. Similarly, in the context of regular
expressions, $\epsilon$ does not stand for the empty string
(as in many places in the literature) but for a pattern that
matches the empty string. Second, the letter $c$ stands for
any character from the alphabet at hand. Again in the context
of regular expressions, it is a particular pattern that can
match the specified string. Third, you should also be careful
with the our overloading of the star: assuming you have read
the handout about our basic mathematical notation, you will
see that in the context of languages (sets of strings) the
star stands for an operation on languages. While $r^*$ stands
for a regular expression, the operation on sets is defined as
\[
A^* \dn \bigcup_{0\le n} A^n
\]
We will use parentheses to disambiguate regular expressions.
Parentheses are not really part of a regular expression, and
indeed we do not need them in our code because there the tree
structure is always clear. But for writing them down in a more
mathematical fashion, parentheses will be helpful. For example
we will write $(r_1 + r_2)^*$, which is different from, say
$r_1 + (r_2)^*$. The former means roughly zero or more times
$r_1$ or $r_2$, while the latter means $r_1$ or zero or more
times $r_2$. This will turn out are two different pattern,
which match in general different strings. We should also write
$(r_1 + r_2) + r_3$, which is different from the regular
expression $r_1 + (r_2 + r_3)$, but in case of $+$ and $\cdot$
we actually do not care about the order and just write $r_1 +
r_2 + r_3$, or $r_1 \cdot r_2 \cdot r_3$, respectively. The
reasons for this will become clear shortly. In the literature
you will often find that the choice $r_1 + r_2$ is written as
$r_1\mid{}r_2$ or $r_1\mid\mid{}r_2$. Also following the
convention in the literature, we will often omit the $\cdot$
all together. This is to make some concrete regular
expressions more readable. For example the regular expression
for email addresses shown in \eqref{email} would look like
\[
\texttt{[...]+} \;\cdot\; \texttt{@} \;\cdot\;
\texttt{[...]+} \;\cdot\; \texttt{.} \;\cdot\;
\texttt{[...]\{2,6\}}
\]
\noindent
which is much less readable than \eqref{email}. Similarly for
the regular expression that matches the string $hello$ we
should write
\[
{\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}
\]
\noindent
but often just write {\it hello}.
If you prefer to think in terms of the implementation
of regular expressions in Scala, the constructors and
classes relate as follows
\begin{center}
\begin{tabular}{rcl}
$\varnothing$ & $\mapsto$ & \texttt{NULL}\\
$\epsilon$ & $\mapsto$ & \texttt{EMPTY}\\
$c$ & $\mapsto$ & \texttt{CHAR(c)}\\
$r_1 + r_2$ & $\mapsto$ & \texttt{ALT(r1, r2)}\\
$r_1 \cdot r_2$ & $\mapsto$ & \texttt{SEQ(r1, r2)}\\
$r^*$ & $\mapsto$ & \texttt{STAR(r)}
\end{tabular}
\end{center}
A source of confusion might arise from the fact that we
use the term \emph{basic regular expression} for the regular
expressions used in ``theory'' and defined above, and
\emph{extended regular expression} for the ones used in
``practice'', for example Scala. If runtime is not of an
issue, then the latter can be seen as some syntactic sugar of
the former. Fo example we could replace
\begin{center}
\begin{tabular}{rcl}
$r^+$ & $\mapsto$ & $r\cdot r^*$\\
$r?$ & $\mapsto$ & $\epsilon + r$\\
$\backslash d$ & $\mapsto$ & $0 + 1 + 2 + \ldots + 9$\\
$[\text{\it a - z}]$ & $\mapsto$ & $a + b + \ldots + z$\\
\end{tabular}
\end{center}
\subsection*{The Meaning of Regular Expressions}
So far we have only considered informally what the
\emph{meaning} of a regular expression is. This is no good for
specifications of what algorithms are supposed to do or which
problems they are supposed to solve.
To do so more formally we will associate with every regular
expression a language, or set of strings, that is supposed to
be matched by this regular expression. To understand what is
going on here it is crucial that you have also read the
handout about our basic mathematical notations.
The meaning of a regular expression can be defined recursively
as follows
\begin{center}
\begin{tabular}{rcl}
$L(\varnothing)$ & $\dn$ & $\varnothing$\\
$L(\epsilon)$ & $\dn$ & $\{[]\}$\\
$L(c)$ & $\dn$ & $\{"c"\}$\\
$L(r_1+ r_2)$ & $\dn$ & $L(r_1) \cup L(r_2)$\\
$L(r_1 \cdot r_2)$ & $\dn$ & $L(r_1) @ L(r_2)$\\
$L(r^*)$ & $\dn$ & $(L(r))^*$\\
\end{tabular}
\end{center}
\noindent
As a result we can now precisely state what the meaning, for example, of the regular expression
${\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}$ is, namely
$L({\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}) = \{\text{\it"hello"}\}$...as expected. Similarly if we have the
choice-regular-expression $a + b$, its meaning is $L(a + b) = \{\text{\it"a"}, \text{\it"b"}\}$, namely the only two strings which can possibly
be matched by this choice. You can now also see why we do not make a difference
between the different regular expressions $(r_1 + r_2) + r_3$ and $r_1 + (r_2 + r_3)$....they
are not the same regular expression, but have the same meaning.
The point of the definition of $L$ is that we can use it to precisely specify when a string $s$ is matched by a
regular expression $r$, namely only when $s \in L(r)$. In fact we will write a program {\it match} that takes any string $s$ and
any regular expression $r$ as argument and returns \emph{yes}, if $s \in L(r)$ and \emph{no},
if $s \not\in L(r)$. We leave this for the next lecture.
\begin{figure}[p]
\lstinputlisting{../progs/crawler1.scala}
\caption{The Scala code for a simple web-crawler that checks
for broken links in a web-page. It uses the regular expression
\texttt{http\_pattern} in Line~15 for recognising URL-addresses.
It finds all links using the library function
\texttt{findAllIn} in Line~21.\label{crawler1}}
\end{figure}
\begin{figure}[p]
\lstinputlisting{../progs/crawler2.scala}
\caption{A version of the web-crawler that only follows links
in ``my'' domain---since these are the ones I am interested in
to fix. It uses the regular expression \texttt{my\_urls} in
Line~16 to check for my name in the links. The main change is
in Lines~26--29 where there is a test whether URL is in ``my''
domain or not.\label{crawler2}}
\end{figure}
\begin{figure}[p]
\lstinputlisting{../progs/crawler3.scala}
\caption{A small email harvester---whenever we download a
web-page, we also check whether it contains any email
addresses. For this we use the regular expression
\texttt{email\_pattern} in Line~16. The main change is in Line
32 where all email addresses that can be found in a page are
printed.\label{crawler3}}
\end{figure}
\pagebreak
Lets start
with what we mean by \emph{strings}. Strings (they are also
sometimes referred to as \emph{words}) are lists of characters
drawn from an \emph{alphabet}. If nothing else is specified,
we usually assume the alphabet consists of just the lower-case
letters $a$, $b$, \ldots, $z$. Sometimes, however, we
explicitly restrict strings to contain, for example, only the
letters $a$ and $b$. In this case we say the alphabet is the
set $\{a, b\}$.
There are many ways how we can write down strings. In programming languages, they are usually
written as {\it "hello"} where the double quotes indicate that we dealing with a string.
Essentially, strings are lists of characters which can be written for example as follows
\[
[\text{\it h, e, l, l, o}]
\]
\noindent
The important point is that we can always decompose strings. For example, we will often consider the
first character of a string, say $h$, and the ``rest'' of a string say {\it "ello"} when making definitions
about strings. There are some subtleties with the empty string, sometimes written as {\it ""} but also as
the empty list of characters $[\,]$. Two strings, for example $s_1$ and $s_2$, can be \emph{concatenated},
which we write as $s_1 @ s_2$. Suppose we are given two strings {\it "foo"} and {\it "bar"}, then their concatenation
gives {\it "foobar"}.
We often need to talk about sets of strings. For example the set of all strings over the alphabet $\{a, \ldots\, z\}$
is
\[
\{\text{\it "", "a", "b", "c",\ldots,"z", "aa", "ab", "ac", \ldots, "aaa", \ldots}\}
\]
\noindent
Any set of strings, not just the set-of-all-strings, is often called a \emph{language}. The idea behind
this choice of terminology is that if we enumerate, say, all words/strings from a dictionary, like
\[
\{\text{\it "the", "of", "milk", "name", "antidisestablishmentarianism", \ldots}\}
\]
\noindent
then we have essentially described the English language, or more precisely all
strings that can be used in a sentence of the English language. French would be a
different set of strings, and so on. In the context of this course, a language might
not necessarily make sense from a natural language point of view. For example
the set of all strings shown above is a language, as is the empty set (of strings). The
empty set of strings is often written as $\varnothing$ or $\{\,\}$. Note that there is a
difference between the empty set, or empty language, and the set that
contains only the empty string $\{\text{""}\}$: the former has no elements, whereas
the latter has one element.
Before we expand on the topic of regular expressions, let us review some operations on
sets. We will use capital letters $A$, $B$, $\ldots$ to stand for sets of strings.
The union of two sets is written as usual as $A \cup B$. We also need to define the
operation of \emph{concatenating} two sets of strings. This can be defined as
\[
A @ B \dn \{s_1@ s_2 | s_1 \in A \wedge s_2 \in B \}
\]
\noindent
which essentially means take the first string from the set $A$ and concatenate it with every
string in the set $B$, then take the second string from $A$ do the same and so on. You might
like to think about what this definition means in case $A$ or $B$ is the empty set.
We also need to define
the power of a set of strings, written as $A^n$ with $n$ being a natural number. This is defined inductively as follows
\begin{center}
\begin{tabular}{rcl}
$A^0$ & $\dn$ & $\{[\,]\}$ \\
$A^{n+1}$ & $\dn$ & $A @ A^n$\\
\end{tabular}
\end{center}
\noindent
Finally we need the \emph{star} of a set of strings, written $A^*$. This is defined as the union
of every power of $A^n$ with $n\ge 0$. The mathematical notation for this operation is
\[
A^* \dn \bigcup_{0\le n} A^n
\]
\noindent
This definition implies that the star of a set $A$ contains always the empty string (that is $A^0$), one
copy of every string in $A$ (that is $A^1$), two copies in $A$ (that is $A^2$) and so on. In case $A=\{"a"\}$ we therefore
have
\[
A^* = \{"", "a", "aa", "aaa", \ldots\}
\]
\noindent
Be aware that these operations sometimes have quite non-intuitive properties, for example
\begin{center}
\begin{tabular}{@{}ccc@{}}
\begin{tabular}{@{}r@{\hspace{1mm}}c@{\hspace{1mm}}l}
$A \cup \varnothing$ & $=$ & $A$\\
$A \cup A$ & $=$ & $A$\\
$A \cup B$ & $=$ & $B \cup A$\\
\end{tabular} &
\begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l}
$A @ B$ & $\not =$ & $B @ A$\\
$A @ \varnothing$ & $=$ & $\varnothing @ A = \varnothing$\\
$A @ \{""\}$ & $=$ & $\{""\} @ A = A$\\
\end{tabular} &
\begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
$\varnothing^*$ & $=$ & $\{""\}$\\
$\{""\}^*$ & $=$ & $\{""\}$\\
$A^\star$ & $=$ & $\{""\} \cup A\cdot A^*$\\
\end{tabular}
\end{tabular}
\end{center}
\bigskip
\subsection*{My Fascination for Regular Expressions}
\end{document}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: t
%%% End: