\documentclass{article}\usepackage{../style}\usepackage{../langs}\usepackage{../graphics}\usepackage{../data}%%http://regexcrossword.com/challenges/cities/puzzles/1\begin{document}\section*{Handout 1}This module is about text processing, be it for web-crawlers,compilers, dictionaries, DNA-data and so on. When looking fora particular string in a large text we can use theKnuth-Morris-Pratt algorithm, which is currently the mostefficient general string search algorithm. But often we do\emph{not} just look for a particular string, but for stringpatterns. For example in program code we need to identifywhat are the keywords, what are the identifiers etc. A patternfor identifiers could be stated as: they start with a letter,followed by zero or more letters, numbers and underscores.Also often we face the problem that we are given a string (forexample some user input) and want to know whether it matches aparticular pattern. In this way we can exclude user input thatwould otherwise have nasty effects on our program (crashing itor making it go into an infinite loop, if not worse).\defn{Regular expressions} help with conveniently specifyingsuch patterns. The idea behind regular expressions is thatthey are a simple method for describing languages (or sets ofstrings)\ldots at least languages we are interested in incomputer science. For example there is no convenient regularexpression for describing the English language short ofenumerating all English words. But they seem useful fordescribing for example email addresses.\footnote{See ``8Regular Expressions You Should Know''\url{http://goo.gl/5LoVX7}} Consider the following regularexpression\begin{equation}\label{email}\texttt{[a-z0-9\_.-]+ @ [a-z0-9.-]+ . [a-z.]\{2,6\}}\end{equation}\noindent where the first part matches one or more lowercaseletters (\pcode{a-z}), digits (\pcode{0-9}), underscores, dotsor hyphens. The \pcode{+} ensures the ``one or more''. Thencomes the \pcode{@}-sign, followed by the domain name whichmust be one or more lowercase letters, digits, underscores,dots or hyphens. Note there cannot be an underscore in thedomain name. Finally there must be a dot followed by thetoplevel domain. This toplevel domain must be 2 to 6 lowercaseletters including the dot. Example strings which follow thispattern are:\begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}]niceandsimple@example.orgvery.common@example.co.uka.little.lengthy.but.fine@dept.example.ac.ukother.email-with-dash@example.edu\end{lstlisting}\noindentBut for example the following two do not:\begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}]user@localserverdisposable.style.email.with+symbol@example.com\end{lstlisting}As mentioned above, identifiers, or variables, in program codeare often required to satisfy the constraints that they startwith a letter and then can be followed by zero or more lettersor numbers and also can include underscores, but not as thefirst character. Such identifiers can be recognised with theregular expression\begin{center}\pcode{[a-zA-Z] [a-zA-Z0-9_]*}\end{center}\noindent Possible identifiers that match this regular expression are \pcode{x}, \pcode{foo}, \pcode{foo_bar_1}, \pcode{A_very_42_long_object_name},but not \pcode{_i} and also not \pcode{4you}.Many programming language offer libraries that can be used tovalidate such strings against regular expressions. Also thereare some common, and I am sure very familiar, ways of how toconstruct regular expressions. For example in Scala we have: \begin{center}\begin{tabular}{lp{9cm}}\pcode{re*} & matches 0 or more occurrences of preceding expression\\\pcode{re+} & matches 1 or more occurrences of precedingexpression\\\pcode{re?} & matches 0 or 1 occurrence of preceding expression\\\pcode{re\{n\}} & matches exactly \pcode{n} number of occurrences of preceding expression\\\pcode{re\{n,m\}} & matches at least \pcode{n} and at most {\tt m}occurences of the preceding expression\\\pcode{[...]} & matches any single character inside the brackets\\\pcode{[^...]} & matches any single character not inside the brackets\\\pcode{...-...} & character ranges\\\pcode{\\d} & matches digits; equivalent to \pcode{[0-9]}\\\pcode{.} & matches every character except newline\\\pcode{(re)} & groups regular expressions and remembers matched text\end{tabular}\end{center}\noindent With this table you can figure out the purpose ofthe regular expressions in the web-crawlers shown Figures\ref{crawler1}, \ref{crawler2} and\ref{crawler3}.\footnote{There is an interesting twist in theweb-scraper where \pcode{re*?} is used instead of\pcode{re*}.} Note, however, the regular expression forhttp-addresses in web-pages in Figure~\ref{crawler1}, Line 15,is intended to be\[\pcode{"https?://[^"]*"}\]\noindent It specifies that web-addresses need to start with adouble quote, then comes \texttt{http} followed by an optional\texttt{s} and so on until the closing double quote comes.Usually we would have to escape the double quotes in order tomake sure we interpret the double quote as character, not asdouble quote for a string. But Scala's trick with triplequotes allows us to omit this kind of escaping. As a result wecan just write:\[\pcode{""""https?://[^"]*"""".r}\]\noindent Note also that the convention in Scala is that\texttt{.r} converts a string into a regular expression. Ileave it to you to ponder whether this regular expressionreally captures all possible web-addresses.\subsection*{Why Study Regular Expressions?}Regular expressions were introduced by Kleene in the 1950iesand they have been object of intense study since then. Theyare nowadays pretty much ubiquitous in computer science. I amsure you have come across them before. Why on earth then isthere any interest in studying them again in depth in thismodule? Well, one answer is in the following graph aboutregular expression matching in Python and in Ruby.\begin{center}\begin{tikzpicture}\begin{axis}[xlabel={\pcode{a}s},ylabel={time in secs}, enlargelimits=false, xtick={0,5,...,30}, xmax=33, ymax=35, ytick={0,5,...,30}, scaled ticks=false, axis lines=left, width=7cm, height=4.5cm, legend entries={Python,Ruby}, legend pos=north west, legend cell align=left]\addplot[blue,mark=*, mark options={fill=white}] table {re-python.data};\addplot[brown,mark=triangle*, mark options={fill=white}] table {re-ruby.data}; \end{axis}\end{tikzpicture}\end{center}\noindent This graph shows that Python needs approximately 29seconds for finding out whether a string of 28 \texttt{a}smatches the regular expression \texttt{[a?]\{28\}[a]\{28\}}.Ruby is even slightly worse.\footnote{In this example Rubyuses the slightly different regular expression\texttt{a?a?a?...a?a?aaa...aa}, where the \texttt{a?} and\texttt{a} each occur $n$ times. More test results can befound at \url{http://www.computerbytesman.com/redos/}.}Admittedly, this regular expression is carefully chosen toexhibit this exponential behaviour, but similar ones occurmore often than one wants in ``real life''. They are sometimescalled \emph{evil regular expressions} because they have thepotential to make regular expression matching engines totopple over, like in Python and Ruby. The problem with evilregular expressions is that they can have some seriousconsequences, for example, if you use them in yourweb-application. The reason is that hackers can look for theseinstances where the matching engine behaves badly and thenmount a nice DoS-attack against your application. Theseattacks are already have their own name: \emph{Regular Expression Denial of Service Attacks (ReDoS)}.It will be instructive to look behind the ``scenes'' to findout why Python and Ruby (and others) behave so badly whenmatching with evil regular expressions. But we will also lookat a relatively simple algorithm that solves this problem muchbetter than Python and Ruby do\ldots actually it will be twoversions of the algorithm: the first one will be able toprocess strings of approximately 1,000 \texttt{a}s in 30seconds, while the second version will even be able to processup to 12,000 in less than 10(!) seconds, see the graph below:\begin{center}\begin{tikzpicture} \begin{axis}[xlabel={\pcode{a}s},ylabel={time in secs}, enlargelimits=false, xtick={0,3000,...,12000}, xmax=12500, ymax=35, ytick={0,5,...,30}, scaled ticks=false, axis lines=left, width=9cm, height=4.5cm]\addplot[green,mark=square*,mark options={fill=white}] table {re2b.data};\addplot[black,mark=square*,mark options={fill=white}] table {re3.data};\end{axis}\end{tikzpicture}\end{center}\subsection*{Basic Regular Expressions}The regular expressions shown above for Scala, wewill call \emph{extended regular expressions}. The ones wewill mainly study in this module are \emph{basic regularexpressions}, which by convention we will just call\emph{regular expressions}, if it is clear what we mean. Theattraction of (basic) regular expressions is that manyfeatures of the extended ones are just syntactic sugar.(Basic) regular expressions are defined by the followinggrammar:\begin{center}\begin{tabular}{r@{\hspace{1mm}}r@{\hspace{1mm}}l@{\hspace{13mm}}l} $r$ & $::=$ & $\varnothing$ & null\\ & $\mid$ & $\epsilon$ & empty string / \texttt{""} / []\\ & $\mid$ & $c$ & single character\\ & $\mid$ & $r_1 + r_2$ & alternative / choice\\ & $\mid$ & $r_1 \cdot r_2$ & sequence\\ & $\mid$ & $r^*$ & star (zero or more)\\ \end{tabular}\end{center}\noindent Because we overload our notation, there are somesubtleties you should be aware of. When regular expressionsare referred to then $\varnothing$ does not stand for theempty set: rather it is a particular pattern that does notmatch any string. Similarly, in the context of regularexpressions, $\epsilon$ does not stand for the empty string(as in many places in the literature) but for a regularexpression that matches the empty string. The letter $c$stands for any character from the alphabet at hand. Again inthe context of regular expressions, it is a particular patternthat can match the specified character. You should also becareful with our overloading of the star: assuming you haveread the handout about our basic mathematical notation, youwill see that in the context of languages (sets of strings)the star stands for an operation on languages. Here $r^*$stands for a regular expression, which is different from theoperation on sets is defined as\[A^* \dn \bigcup_{0\le n} A^n\]\noindent Note that this expands to\[A^* \dn A^0 \cup A^1 \cup A^2 \cup A^3 \cup A^4 \cup \ldots\]\noindent which is equivalent to\[A^* \dn \{[]\} \cup A \cup A@A \cup A@A@A \cup A@A@A@A \cup \ldots\]\noindent Remember that $A^0$ is always the set containing the empty string.We will use parentheses to disambiguate regular expressions.Parentheses are not really part of a regular expression, andindeed we do not need them in our code because there the treestructure of regular expressions is always clear. But forwriting them down in a more mathematical fashion, parentheseswill be helpful. For example we will write $(r_1 + r_2)^*$,which is different from, say $r_1 + (r_2)^*$. The former meansroughly zero or more times $r_1$ or $r_2$, while the lattermeans $r_1$ or zero or more times $r_2$. This will turn out tobe two different patterns, which match in general differentstrings. We should also write $(r_1 + r_2) + r_3$, which isdifferent from the regular expression $r_1 + (r_2 + r_3)$, butin case of $+$ and $\cdot$ we actually do not care about theorder and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2\cdot r_3$, respectively. The reasons for this will becomeclear shortly. In the literature you will often find that thechoice $r_1 + r_2$ is written as $r_1\mid{}r_2$ or$r_1\mid\mid{}r_2$. Also following the convention in theliterature, we will often omit the $\cdot$ all together. Thisis to make some concrete regular expressions more readable.For example the regular expression for email addresses shownin \eqref{email} would look like\[\texttt{[...]+} \;\cdot\; \texttt{@} \;\cdot\; \texttt{[...]+} \;\cdot\; \texttt{.} \;\cdot\; \texttt{[...]\{2,6\}}\]\noindentwhich is much less readable than \eqref{email}. Similarly forthe regular expression that matches the string $hello$ we should write\[h \cdot e \cdot l \cdot l \cdot o\]\noindentbut often just write {\it hello}.If you prefer to think in terms of the implementationof regular expressions in Scala, the constructors andclasses relate as follows\footnote{More about Scala is in the handout about A Crash-Course on Scala.}\begin{center}\begin{tabular}{rcl}$\varnothing$ & $\mapsto$ & \texttt{NULL}\\$\epsilon$ & $\mapsto$ & \texttt{EMPTY}\\$c$ & $\mapsto$ & \texttt{CHAR(c)}\\$r_1 + r_2$ & $\mapsto$ & \texttt{ALT(r1, r2)}\\$r_1 \cdot r_2$ & $\mapsto$ & \texttt{SEQ(r1, r2)}\\$r^*$ & $\mapsto$ & \texttt{STAR(r)}\end{tabular}\end{center}A source of confusion might arise from the fact that weuse the term \emph{basic regular expression} for the regularexpressions used in ``theory'' and defined above, and\emph{extended regular expression} for the ones used in``practice'', for example in Scala. If runtime is not anissue, then the latter can be seen as syntactic sugar ofthe former. For example we could replace\begin{center}\begin{tabular}{rcl}$r+$ & $\mapsto$ & $r\cdot r^*$\\$r?$ & $\mapsto$ & $\epsilon + r$\\$\backslash d$ & $\mapsto$ & $0 + 1 + 2 + \ldots + 9$\\$[\text{\it a - z}]$ & $\mapsto$ & $a + b + \ldots + z$\\\end{tabular}\end{center}\subsection*{The Meaning of Regular Expressions}So far we have only considered informally what the\emph{meaning} of a regular expression is. This is not goodenough for specifications of what algorithms are supposed todo or which problems they are supposed to solve.To define the meaning of a regular expression we willassociate with every regular expression a language, or set ofstrings. This language contains all the strings the regularexpression is supposed to match. To understand what is goingon here it is crucial that you have read the handoutabout basic mathematical notations.The \defn{meaning of a regular expression} can be definedby a recursive function called $L$ (for language), whichis defined as follows\begin{center}\begin{tabular}{rcl}$L(\varnothing)$ & $\dn$ & $\{\}$\\$L(\epsilon)$ & $\dn$ & $\{[]\}$\\$L(c)$ & $\dn$ & $\{[c]\}$\\$L(r_1+ r_2)$ & $\dn$ & $L(r_1) \cup L(r_2)$\\$L(r_1 \cdot r_2)$ & $\dn$ & $L(r_1) \,@\, L(r_2)$\\$L(r^*)$ & $\dn$ & $(L(r))^*$\\\end{tabular}\end{center}\noindent As a result we can now precisely state what themeaning, for example, of the regular expression $h \cdote \cdot l \cdot l \cdot o$ is, namely\[L(h \cdot e \cdot l \cdot l \cdot o) = \{"hello"\}\]\noindent This is expected because this regular expression is only supposed to match the ``$hello$''-string. Similarly ifwe have the choice-regular-expression $a + b$, its meaning is\[L(a + b) = \{"a", "b"\}\]\noindent You can now also see why we do not make a differencebetween the different regular expressions $(r_1 + r_2) + r_3$and $r_1 + (r_2 + r_3)$\ldots they are not the same regularexpression, but they have the same meaning. For exampleyou can do the following calculation which shows theyhave the same meaning:\begin{eqnarray*}L((r_1 + r_2) + r_3) & = & L(r_1 + r_2) \cup L(r_3)\\ & = & L(r_1) \cup L(r_2) \cup L(r_3)\\ & = & L(r_1) \cup L(r_2 + r_3)\\ & = & L(r_1 + (r_2 + r_3))\end{eqnarray*}The point of the definition of $L$ is that we can use it toprecisely specify when a string $s$ is matched by a regularexpression $r$, namely if and only if $s \in L(r)$. In fact wewill write a program \pcode{match} that takes any string $s$and any regular expression $r$ as argument and returns\emph{yes}, if $s \in L(r)$ and \emph{no}, if $s \not\inL(r)$. We leave this for the next lecture.There is one more feature of regular expressions that is worthmentioning. Given some strings, there are in general manydifferent regular expressions that can recognise thesestrings. This is obvious with the regular expression $a + b$which can match the strings $a$ and $b$. But also the regularexpression $b + a$ would match the same strings. However,sometimes it is not so obvious whether two regular expressionsmatch the same strings: for example do $r^*$ and $\epsilon + r\cdot r^*$ match the same strings? What about $\varnothing^*$and $\epsilon^*$? This suggests the following relation between\defn{equivalent regular expressions}: \[r_1 \equiv r_2 \;\dn\; L(r_1) = L(r_2)\]\noindent That means two regular expressions are said to beequivalent if they match the same set of strings. Therefore wedo not really distinguish between the different regularexpression $(r_1 + r_2) + r_3$ and $r_1 + (r_2 + r_3)$,because they are equivalent. I leave you to the questionwhether\[\varnothing^* \equiv \epsilon^*\]\noindent holds? Such equivalences will be important for ourmatching algorithm, because we can use them to simplify regular expressions. \subsection*{My Fascination for Regular Expressions}Up until a few years ago I was not really interested inregular expressions. They have been studied for the last 60years (by smarter people than me)---surely nothing new can befound out about them. I even have the vague recollection thatI did not quite understand them during my study. If I remembercorrectly,\footnote{That was really a long time ago.} I gotutterly confused about $\epsilon$ and the emptystring.\footnote{Obviously the lecturer must have been bad.}Since my study, I have used regular expressions forimplementing lexers and parsers as I have always beeninterested in all kinds of programming languages andcompilers, which invariably need regular expression in someform or shape. To understand my fascination nowadays with regularexpressions, you need to know that my main scientific interestfor the last 14 years has been with theorem provers. I am acore developer of one ofthem.\footnote{\url{http://isabelle.in.tum.de}} Theoremprovers are systems in which you can formally reason aboutmathematical concepts, but also about programs. In this waythey can help with writing bug-free code. Theorem provers haveproved already their value in a number of systems (even interms of hard cash), but they are still clunky and difficultto use by average programmers.Anyway, in about 2011 I came across the notion of\defn{derivatives of regular expressions}. This notion allowsone to do almost all calculations in regular language theoryon the level of regular expressions, not needing any automata.This is important because automata are graphs and it is ratherdifficult to reason about graphs in theorem provers. Incontrast, to reason about regular expressions is easy-peasy intheorem provers. Is this important? I think yes, becauseaccording to Kuklewicz nearly all POSIX-based regularexpression matchers arebuggy.\footnote{\url{http://www.haskell.org/haskellwiki/Regex_Posix}}With my PhD student Fahad Ausaf I am currently working onproving the correctness for one such matcher that wasproposed by Sulzmann and Lu in2014.\footnote{\url{http://goo.gl/bz0eHp}} This would be anattractive results since we will be able to prove that thealgorithm is really correct, but also that the machine code(!)that implements this code efficiently is correct. Writingprograms in this way does not leave any room for potentialerrors or bugs. How nice!What also helped with my fascination with regular expressionsis that we could indeed find out new things about them thathave surprised some experts in the field of regularexpressions. Together with two colleagues from China, I wasable to prove the Myhill-Nerode theorem by only using regularexpressions and the notion of derivatives. Earlier versions ofthis theorem used always automata in the proof. Using thistheorem we can show that regular languages are closed undercomplementation, something which Gasarch in hisblog\footnote{\url{http://goo.gl/2R11Fw}} assumed can only beshown via automata. Even sombody who has written a 700+-pagebook\footnote{\url{http://goo.gl/fD0eHx}} on regularexprssions did not know better. Well, we showed it can also bedone with regular expressions only.\footnote{\url{http://www.inf.kcl.ac.uk/staff/urbanc/Publications/rexp.pdf}} What a feeling if you are an outsider to the subject!To conclude: Despite my early ignorance about regularexpressions, I find them now quite interesting. They have abeautiful mathematical theory behind them. They have practicalimportance (remember the shocking runtime of the regularexpression matchers in Python and Ruby in some instances).People who are not very familiar with the mathematicalbackground of regular expressions get them consistently wrong.The hope is that we can do better in the future---for exampleby proving that the algorithms actually satisfy theirspecification and that the corresponding implementations donot contain any bugs. We are close, but not yet quite there.Notwithstanding my fascination, I am also happy to admit that regularexpressions have their shortcomings. There are some well-known``theoretical'' shortcomings, for example recognising stringsof the form $a^{n}b^{n}$. I am not so bothered by them. What Iam bothered about is when regular expressions are in the wayof practical programming. For example, it turns out that theregular expression for email addresses shown in \eqref{email}is hopelessly inadequate for recognising all of them (despitebeing touted as something every computer scientist should knowabout). The W3 Consortium (which standardises the Web)proposes to use the following, already more complicatedregular expressions for email addresses:{\small\begin{lstlisting}[language={},keywordstyle=\color{black},numbers=none][a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*\end{lstlisting}}\noindent But they admit that by using this regular expressionthey wilfully violate the RFC 5322 standard which specifiesthe syntax of email addresses. With their proposed regularexpression they are too strict in some cases and too lax inothers. Not a good situation to be in. A regular expressionthat is claimed to be closer to the standard is shown inFigure~\ref{monster}. Whether this claim is true or not, Iwould not know---the only thing I can say to this regularexpression is it is a monstrosity. However, this mightactually be an argument against the RFC standard, rather thanagainst regular expressions. Still it is good to know thatsome tasks in text processing just cannot be achieved by usingregular expressions.\begin{figure}[p]\lstinputlisting{../progs/crawler1.scala}\caption{The Scala code for a simple web-crawler that checksfor broken links in a web-page. It uses the regular expression\texttt{http\_pattern} in Line~15 for recognising URL-addresses.It finds all links using the library function\texttt{findAllIn} in Line~21.\label{crawler1}}\end{figure}\begin{figure}[p]\lstinputlisting{../progs/crawler2.scala}\caption{A version of the web-crawler that only follows linksin ``my'' domain---since these are the ones I am interested into fix. It uses the regular expression \texttt{my\_urls} inLine~16 to check for my name in the links. The main change isin Lines~24--28 where there is a test whether URL is in ``my''domain or not.\label{crawler2}}\end{figure}\begin{figure}[p]\lstinputlisting{../progs/crawler3.scala}\caption{A small email harvester---whenever we download aweb-page, we also check whether it contains any emailaddresses. For this we use the regular expression\texttt{email\_pattern} in Line~15. The main change is in Line30 where all email addresses that can be found in a page areprinted.\label{crawler3}}\end{figure}\begin{figure}[p]\tiny\begin{center}\begin{minipage}{0.8\textwidth}\lstinputlisting[language={},keywordstyle=\color{black},numbers=none]{../progs/email-rexp}\end{minipage}\end{center}\caption{Nothing that can be said\ldots\label{monster}}\end{figure}\end{document}%%% Local Variables: %%% mode: latex%%% TeX-master: t%%% End: