handouts/ho01.tex
author Christian Urban <christian dot urban at kcl dot ac dot uk>
Mon, 15 Sep 2014 04:42:48 +0100
changeset 248 ce767ca23244
parent 245 a5fade10c207
child 250 b79e704acb72
permissions -rw-r--r--
updated

\documentclass{article}
\usepackage{../style}
\usepackage{../langs}
\usepackage{../graphics}
\usepackage{../data}


\begin{document}

\section*{Handout 1}

This module is about text processing, be it for web-crawlers,
compilers, dictionaries, DNA-data and so on. When looking for
a particular string in a large text we can use the
Knuth-Morris-Pratt algorithm, which is currently the most
efficient general string search algorithm. But often we do
\emph{not} just look for a particular string, but for string
patterns. For example in programming code we need to identify
what are the keywords, what are the identifiers etc. A pattern
for identifiers could be that they start with a letter,
followed by zero or more letters, numbers and the underscore.
Also often we face the problem that we are given a string (for
example some user input) and want to know whether it matches a
particular pattern. In this way we can exclude user input that
would otherwise have nasty effects on our program (crashing it
or going into an infinite loop, if not worse). \defn{Regular
expressions} help with conveniently specifying such patterns. 
The idea behind regular expressions is that they are a simple
method for describing languages (or sets of strings)\ldots at
least languages we are interested in in computer science. For
example there is no convenient regular expression for
describing the English language short of enumerating all
English words. But they seem useful for describing for example
email addresses.\footnote{See ``8 Regular Expressions You
Should Know'' \url{http://goo.gl/5LoVX7}} Consider the
following regular expression

\begin{equation}\label{email}
\texttt{[a-z0-9\_.-]+ @ [a-z0-9.-]+ . [a-z.]\{2,6\}}
\end{equation}

\noindent where the first part matches one or more lowercase
letters (\pcode{a-z}), digits (\pcode{0-9}), underscores, dots
or hyphens. The \pcode{+} ensures the ``one or more''. Then
comes the \pcode{@}-sign, followed by the domain name which
must be one or more lowercase letters, digits, underscores,
dots or hyphens. Note there cannot be an underscore in the
domain name. Finally there must be a dot followed by the
toplevel domain. This toplevel domain must be 2 to 6 lowercase
letters including the dot. Example strings which follow this
pattern are:

\begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}]
niceandsimple@example.org
very.common@example.co.uk
a.little.lengthy.but.fine@dept.example.ac.uk
other.email-with-dash@example.edu
\end{lstlisting}


\noindent
But for example the following two do not:

\begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}]
user@localserver
disposable.style.email.with+symbol@example.com
\end{lstlisting}

Identifiers, or variables, in program text are often required
to satisfy the constraints that they start with a letter and
then can be followed by zero or more letters or numbers and
also can include underscores, but not as the first character.
Such identifiers can be recognised with the regular expression

\begin{center}
\pcode{[a-zA-Z] [a-zA-Z0-9_]*}
\end{center}

\noindent Possible identifiers that match this regular expression 
are \pcode{x}, \pcode{foo}, \pcode{foo_bar_1}, \pcode{A_very_42_long_object_name},
but not \pcode{_i} and also not \pcode{4you}.

Many programming language offer libraries that can be used to
validate such strings against regular expressions. Also there
are some common, and I am sure very familiar, ways of how to
construct regular expressions. For example in Scala we have: 

\begin{center}
\begin{tabular}{lp{9cm}}
\pcode{re*} & matches 0 or more occurrences of preceding 
expression\\
\pcode{re+} & matches 1 or more occurrences of preceding
expression\\
\pcode{re?} &	 matches 0 or 1 occurrence of preceding 
expression\\
\pcode{re\{n\}}	& matches exactly \pcode{n} number of 
occurrences of preceding  expression\\
\pcode{re\{n,m\}} & matches at least \pcode{n} and at most {\tt m}
occurences of the preceding expression\\
\pcode{[...]} & matches any single character inside the 
brackets\\
\pcode{[^...]} & matches any single character not inside the 
brackets\\
\pcode{..-..} & character ranges\\
\pcode{\\d} &	matches digits; equivalent to \pcode{[0-9]}
\end{tabular}
\end{center}

\noindent With this table you can figure out the purpose of
the regular expressions in the web-crawlers shown Figures
\ref{crawler1}, \ref{crawler2} and \ref{crawler3}. Note,
however, the regular expression for http-addresses in
web-pages is meant to be

\[
\pcode{"https?://[^"]*"}
\]

\noindent It specifies that web-addresses need to start with a
double quote, then comes \texttt{http} followed by an optional
\texttt{s} and so on. Usually we would have to escape the
double quotes in order to make sure we interpret the double
quote as character, not as double quote for a string. But
Scala's trick with triple quotes allows us to omit this kind
of escaping. As a result we can just write:

\[
\pcode{""""https?://[^"]*"""".r}
\]

\noindent Note also that the convention in Scala is that
\texttt{.r} converts a string into a regular expression. I
leave it to you to ponder whether this regular expression
really captures all possible web-addresses.

\subsection*{Why Study Regular Expressions?}

Regular expressions were introduced by Kleene in the 1950ies
and they have been object of intense study since then. They
are nowadays pretty much ubiquitous in computer science. I am
sure you have come across them before. Why on earth then is
there any interest in studying them again in depth in this
module? Well, one answer is in the following graph about
regular expression matching in Python and in Ruby.

\begin{center}
\begin{tikzpicture}[y=.1cm, x=.15cm]
 	%axis
	\draw (0,0) -- coordinate (x axis mid) (30,0);
   \draw (0,0) -- coordinate (y axis mid) (0,30);
   %ticks
   \foreach \x in {0,5,...,30}
     \draw (\x,1pt) -- (\x,-3pt) node[anchor=north] {\x};
   \foreach \y in {0,5,...,30}
     \draw (1pt,\y) -- (-3pt,\y) node[anchor=east] {\y}; 
	%labels      
	\node[below=0.6cm] at (x axis mid) {number of \texttt{a}s};
	\node[rotate=90,above=0.8cm] at (y axis mid) {time in secs};
	%plots
	\draw[color=blue] plot[mark=*] 
		file {re-python.data};
	\draw[color=brown] plot[mark=triangle*] 
		file {re-ruby.data};	 
   %legend
	\begin{scope}[shift={(4,20)}] 
	\draw[color=blue] (0,0) -- 
		plot[mark=*] (0.25,0) -- (0.5,0) 
		node[right]{\small Python};
	\draw[yshift=-\baselineskip, color=brown] (0,0) -- 
		plot[mark=triangle*] (0.25,0) -- (0.5,0)
		node[right]{\small Ruby};
	\end{scope}	
\end{tikzpicture}
\end{center}

\noindent This graph shows that Python needs approximately 29
seconds for finding out whether a string of 28 \texttt{a}s
matches the regular expression \texttt{[a?]\{28\}[a]\{28\}}.
Ruby is even slightly worse.\footnote{In this example Ruby
uses the slightly different regular expression
\texttt{a?a?a?...a?a?aaa...aa}, where the \texttt{a?} and
\texttt{a} each occur $n$ times.} Admittedly, this regular
expression is carefully chosen to exhibit this exponential
behaviour, but similar ones occur more often than one wants in
``real life''. They are sometimes called \emph{evil regular
expressions} because they have the potential to make regular
expression matching engines to topple over, like in Python and
Ruby. The problem with evil regular expressions is that they
can have some serious consequences, for example, if you use
them in your web-application. The reason is that hackers can
look for these instances where the matching engine behaves
badly and then mount a nice DoS-attack against your
application.

It will be instructive to look behind the ``scenes'' to find
out why Python and Ruby (and others) behave so badly when
matching with evil regular expressions. But we will also look
at a relatively simple algorithm that solves this problem much
better than Python and Ruby do\ldots actually it will be two
versions of the algorithm: the first one will be able to
process strings of approximately 1,000 \texttt{a}s in 30
seconds, while the second version will even be able to process
up to 12,000 in less than 10(!) seconds, see the graph below:

\begin{center}
\begin{tikzpicture}[y=.1cm, x=.0006cm]
 	%axis
	\draw (0,0) -- coordinate (x axis mid) (12000,0);
   \draw (0,0) -- coordinate (y axis mid) (0,30);
   %ticks
   \foreach \x in {0,2000,...,12000}
     	\draw (\x,1pt) -- (\x,-3pt) node[anchor=north] {\x};
   \foreach \y in {0,5,...,30}
     	\draw (1pt,\y) -- (-3pt,\y) node[anchor=east] {\y}; 
	%labels      
	\node[below=0.6cm] at (x axis mid) {number of \texttt{a}s};
	\node[rotate=90, above=0.8cm] at (y axis mid) {time in secs};

	%plots
   \draw[color=green] plot[mark=square*, mark options={fill=white} ] 
		file {re2b.data};
	\draw[color=black] plot[mark=square*, mark options={fill=white} ] 
		file {re3.data};	 
\end{tikzpicture}
\end{center}

\subsection*{Basic Regular Expressions}

The regular expressions shown above, for example for Scala, we
will call \emph{extended regular expressions}. The ones we
will mainly study in this module are \emph{basic regular
expressions}, which by convention we will just call
\emph{regular expressions}, if it is clear what we mean. The
attraction of (basic) regular expressions is that many
features of the extended ones are just syntactic sugar.
(Basic) regular expressions are defined by the following
grammar:

\begin{center}
\begin{tabular}{r@{\hspace{1mm}}r@{\hspace{1mm}}l@{\hspace{13mm}}l}
  $r$ & $::=$ &   $\varnothing$         & null\\
        & $\mid$ & $\epsilon$           & empty string / "" / []\\
        & $\mid$ & $c$                  & single character\\
        & $\mid$ & $r_1 + r_2$          & alternative / choice\\
        & $\mid$ & $r_1 \cdot r_2$      & sequence\\
        & $\mid$ & $r^*$                & star (zero or more)\\
  \end{tabular}
\end{center}

\noindent Because we overload our notation, there are some
subtleties you should be aware of. When regular expressions
are referred to then $\varnothing$ does not stand for the
empty set: rather it is a particular pattern that does not
match any string. Similarly, in the context of regular
expressions, $\epsilon$ does not stand for the empty string
(as in many places in the literature) but for a regular
expression that matches the empty string. The letter $c$
stands for any character from the alphabet at hand. Again in
the context of regular expressions, it is a particular pattern
that can match the specified character. You should also be
careful with our overloading of the star: assuming you have
read the handout about our basic mathematical notation, you
will see that in the context of languages (sets of strings)
the star stands for an operation on languages. Here $r^*$
stands for a regular expression, which is different from the
operation on sets is defined as

\[
A^* \dn \bigcup_{0\le n} A^n
\]

We will use parentheses to disambiguate regular expressions.
Parentheses are not really part of a regular expression, and
indeed we do not need them in our code because there the tree
structure of regular expressions is always clear. But for
writing them down in a more mathematical fashion, parentheses
will be helpful. For example we will write $(r_1 + r_2)^*$,
which is different from, say $r_1 + (r_2)^*$. The former means
roughly zero or more times $r_1$ or $r_2$, while the latter
means $r_1$ or zero or more times $r_2$. This will turn out to
be two different patterns, which match in general different
strings. We should also write $(r_1 + r_2) + r_3$, which is
different from the regular expression $r_1 + (r_2 + r_3)$, but
in case of $+$ and $\cdot$ we actually do not care about the
order and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2
\cdot r_3$, respectively. The reasons for this will become
clear shortly. In the literature you will often find that the
choice $r_1 + r_2$ is written as $r_1\mid{}r_2$ or
$r_1\mid\mid{}r_2$. Also following the convention in the
literature, we will often omit the $\cdot$ all together. This
is to make some concrete regular expressions more readable.
For example the regular expression for email addresses shown
in \eqref{email} would look like

\[
\texttt{[...]+} \;\cdot\;  \texttt{@} \;\cdot\; 
\texttt{[...]+} \;\cdot\; \texttt{.} \;\cdot\; 
\texttt{[...]\{2,6\}}
\]

\noindent
which is much less readable than \eqref{email}. Similarly for
the regular expression that matches the string $hello$ we 
should write

\[
h \cdot e \cdot l \cdot l \cdot o
\]

\noindent
but often just write {\it hello}.

If you prefer to think in terms of the implementation
of regular expressions in Scala, the constructors and
classes relate as follows\footnote{More about Scala is 
in the handout about A Crash-Course on Scala.}

\begin{center}
\begin{tabular}{rcl}
$\varnothing$ & $\mapsto$ & \texttt{NULL}\\
$\epsilon$    & $\mapsto$ & \texttt{EMPTY}\\
$c$           & $\mapsto$ & \texttt{CHAR(c)}\\
$r_1 + r_2$   & $\mapsto$ & \texttt{ALT(r1, r2)}\\
$r_1 \cdot r_2$ & $\mapsto$ & \texttt{SEQ(r1, r2)}\\
$r^*$         & $\mapsto$ & \texttt{STAR(r)}
\end{tabular}
\end{center}

A source of confusion might arise from the fact that we
use the term \emph{basic regular expression} for the regular
expressions used in ``theory'' and defined above, and
\emph{extended regular expression} for the ones used in
``practice'', for example in Scala. If runtime is not an
issue, then the latter can be seen as syntactic sugar of
the former. For example we could replace

\begin{center}
\begin{tabular}{rcl}
$r+$ & $\mapsto$ & $r\cdot r^*$\\
$r?$ & $\mapsto$ & $\epsilon + r$\\
$\backslash d$ & $\mapsto$ & $0 + 1 + 2 + \ldots + 9$\\
$[\text{\it a - z}]$ & $\mapsto$ & $a + b + \ldots + z$\\
\end{tabular}
\end{center}


\subsection*{The Meaning of Regular Expressions}

So far we have only considered informally what the
\emph{meaning} of a regular expression is. This is not good
enough for specifications of what algorithms are supposed to
do or which problems they are supposed to solve.

To define the meaning of a regular expression we will
associate with every regular expression a language, or set of
strings. This language contains all the strings the regular
expression is supposed to match. To understand what is going
on here it is crucial that you have read the handout
about basic mathematical notations.

The \defn{meaning of a regular expression} can be defined
by a recursive function called $L$ (for language), which
is defined as follows

\begin{center}
\begin{tabular}{rcl}
$L(\varnothing)$  & $\dn$ & $\varnothing$\\
$L(\epsilon)$     & $\dn$ & $\{[]\}$\\
$L(c)$            & $\dn$ & $\{"c"\}$\\
$L(r_1+ r_2)$     & $\dn$ & $L(r_1) \cup L(r_2)$\\
$L(r_1 \cdot r_2)$ & $\dn$ & $L(r_1) \,@\, L(r_2)$\\
$L(r^*)$           & $\dn$ & $(L(r))^*$\\
\end{tabular}
\end{center}

\noindent As a result we can now precisely state what the
meaning, for example, of the regular expression $h \cdot
e \cdot l \cdot l \cdot o$ is, namely

\[
L(h \cdot e \cdot  l \cdot l \cdot o) = \{"hello"\}
\]

\noindent This is expected because this regular expression 
is only supposed to match the ``$hello$''-string. Similarly if
we have the choice-regular-expression $a + b$, its meaning is

\[
L(a + b) = \{"a", "b"\}
\]

\noindent You can now also see why we do not make a difference
between the different regular expressions $(r_1 + r_2) + r_3$
and $r_1 + (r_2 + r_3)$\ldots they are not the same regular
expression, but they have the same meaning. For example

\begin{eqnarray*}
L((r_1 + r_2) + r_3) & = & L(r_1 + r_2) \cup L(r_3)\\
                     & = & L(r_1) \cup L(r_2) \cup L(r_3)\\
                     & = & L(r_1) \cup L(r_2 + r_3)\\
                     & = & L(r_1 + (r_2 + r_3))
\end{eqnarray*}

The point of the definition of $L$ is that we can use it to
precisely specify when a string $s$ is matched by a regular
expression $r$, namely if and only if $s \in L(r)$. In fact we
will write a program \pcode{match} that takes any string $s$
and any regular expression $r$ as argument and returns
\emph{yes}, if $s \in L(r)$ and \emph{no}, if $s \not\in
L(r)$. We leave this for the next lecture.

There is one more feature of regular expressions that is worth
mentioning. Given some strings, there are in general many
different regular expressions that can recognise these
strings. This is obvious with the regular expression $a + b$
which can match the strings $a$ and $b$. But also the regular
expression $b + a$ would match the same strings. However,
sometimes it is not so obvious whether two regular expressions
match the same strings: for example do $r^*$ and $\epsilon + r
\cdot r^*$ match the same strings? What about $\varnothing^*$
and $\epsilon^*$? This suggests the following relation between
\defn{equivalent regular expressions}: 

\[
r_1 \equiv r_2 \;\dn\; L(r_1) = L(r_2)
\]

\noindent That means two regular expressions are said to be
equivalent if they match the same set of strings. Therefore we
do not really distinguish between the different regular
expression $(r_1 + r_2) + r_3$ and $r_1 + (r_2 + r_3)$,
because they are equivalent. I leave you to the question
whether

\[
\varnothing^* \equiv \epsilon^*
\]

\noindent holds. Such equivalences will be important for our
matching algorithm, because we can use them to simplify 
regular expressions. 

\subsection*{My Fascination for Regular Expressions}

Up until a few years ago I was not really interested in
regular expressions. They have been studied for the last 60
years (by smarter people than me)---surely nothing new can be
found out about them. I even have the vague recollection that
I did not quite understand them during my study. If I remember
correctly,\footnote{That was really a long time ago.} I got
utterly confused about $\epsilon$ and the empty
string.\footnote{Obviously the lecturer must have been bad.}
Since my study, I have used regular expressions for
implementing lexers and parsers as I have always been
interested in all kinds of programming languages and
compilers, which invariably need regular expression in some
form or shape. 

To understand my fascination nowadays with regular
expressions, you need to know that my main scientific interest
for the last 14 years has been with theorem provers. I am a
core developer of one of
them.\footnote{\url{http://isabelle.in.tum.de}} Theorem
provers are systems in which you can formally reason about
mathematical concepts, but also about programs. In this way
they can help with writing bug-free code. Theorem provers have
proved already their value in a number of systems (even in
terms of hard cash), but they are still clunky and difficult
to use by average programmers.

Anyway, in about 2011 I came across the notion of
\defn{derivatives of regular expressions}. This notion allows
one to do almost all calculations in regular language theory
on the level of regular expressions, not needing any automata.
This is important because automata are graphs and it is rather
difficult to reason about graphs in theorem provers. In
contrast, to reason about regular expressions is easy-peasy in
theorem provers. Is this important? I think yes, because
according to Kuklewicz nearly all POSIX-based regular
expression matchers are
buggy.\footnote{\url{http://www.haskell.org/haskellwiki/Regex_Posix}}
With my PhD student Fahad Ausaf I am currently working on
proving the correctness for one such matcher that was
proposed by Sulzmann and Lu in
2014.\footnote{\url{http://goo.gl/bz0eHp}} This would be an
attractive results since we will be able to prove that the
algorithm is really correct, but also that the machine code(!)
that implements this code efficiently is correct. Writing
programs in this way does not leave any room for potential
errors or bugs. How nice!

What also helped with my fascination with regular expressions
is that we could indeed find out new things about them that
have surprised some experts in the field of regular
expressions. Together with two colleagues from China, I was
able to prove the Myhill-Nerode theorem by only using regular
expressions and the notion of derivatives. Earlier versions of
this theorem used always automata in the proof. Using this
theorem we can show that regular languages are closed under
complementation, something which Gasarch in his
blog\footnote{\url{http://goo.gl/2R11Fw}} assumed can only be
shown via automata. Even sombody who has written a 700+-page
book\footnote{\url{http://goo.gl/fD0eHx}} on regular
exprssions did not know better. Well, we showed it can also be
done with regular expressions only.\footnote{\url{http://www.inf.kcl.ac.uk/staff/urbanc/Publications/rexp.pdf}} 
What a feeling if you are an outsider to the subject!

To conclude: Despite my early ignorance about regular
expressions, I find them now quite interesting. They have a
beautiful mathematical theory behind them. They have practical
importance (remember the shocking runtime of the regular
expression matchers in Python and Ruby in some instances).
People who are not very familiar with the mathematical
background of regular expressions get them consistently wrong.
The hope is that we can do better in the future---for example
by proving that the algorithms actually satisfy their
specification and that the corresponding implementations do
not contain any bugs. We are close, but not yet quite there.

Despite my fascination, I am also happy to admit that regular
expressions have their shortcomings. There are some well-known
``theoretical'' shortcomings, for example recognising strings
of the form $a^{n}b^{n}$. I am not so bothered by them. What I
am bothered about is when regular expressions are in the way
of practical programming. For example, it turns out that the
regular expression for email addresses shown in \eqref{email}
is hopelessly inadequate for recognising all of them (despite
being touted as something every computer scientist should know
about). The W3 Consortium (which standardises the Web)
proposes to use the following, already more complicated
regular expressions for email addresses:

{\small\begin{lstlisting}[language={},keywordstyle=\color{black},numbers=none]
[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*
\end{lstlisting}}

\noindent But they admit that by using this regular expression
they wilfully violate the RFC 5322 standard which specifies
the syntax of email addresses. With their proposed regular
expression they are too strict in some cases and too lax in
others. Not a good situation to be in. A regular expression
that is claimed to be closer to the standard is shown in
Figure~\ref{monster}. Whether this claim is true or not, I
would not know---the only thing I can say to this regular
expression is it is a monstrosity. However, this might
actually be an argument against the RFC standard, rather than
against regular expressions. Still it is good to know that
some tasks in text processing just cannot be achieved by using
regular expressions.


\begin{figure}[p]
\lstinputlisting{../progs/crawler1.scala}
\caption{The Scala code for a simple web-crawler that checks
for broken links in a web-page. It uses the regular expression
\texttt{http\_pattern} in Line~15 for recognising URL-addresses.
It finds all links using the library function
\texttt{findAllIn} in Line~21.\label{crawler1}}
\end{figure}


\begin{figure}[p]
\lstinputlisting{../progs/crawler2.scala}

\caption{A version of the web-crawler that only follows links
in ``my'' domain---since these are the ones I am interested in
to fix. It uses the regular expression \texttt{my\_urls} in
Line~16 to check for my name in the links. The main change is
in Lines~26--29 where there is a test whether URL is in ``my''
domain or not.\label{crawler2}}

\end{figure}

\begin{figure}[p]
\lstinputlisting{../progs/crawler3.scala}

\caption{A small email harvester---whenever we download a
web-page, we also check whether it contains any email
addresses. For this we use the regular expression
\texttt{email\_pattern} in Line~16. The main change is in Line
32 where all email addresses that can be found in a page are
printed.\label{crawler3}}
\end{figure}

\begin{figure}[p]
\tiny
\begin{center}
\begin{minipage}{0.8\textwidth}
\lstinputlisting[language={},keywordstyle=\color{black},numbers=none]{../progs/email-rexp}
\end{minipage}
\end{center}

\caption{Nothing that can be said\ldots\label{monster}}
\end{figure}


\end{document}

%%% Local Variables: 
%%% mode: latex
%%% TeX-master: t
%%% End: