\documentclass{article}\usepackage{../style}\usepackage{../langs}\usepackage{../graphics}\usepackage{../data}%%http://regexcrossword.com/challenges/cities/puzzles/1%%https://jex.im/regulex/%%https://www.reddit.com/r/ProgrammingLanguages/comments/42dlem/mona_compiler_development_part_2_parsing/%%https://www.reddit.com/r/ProgrammingLanguages/comments/43wlkq/formal_grammar_for_csh_tsch_sh_or_bash/%% regex displayers%% https://regexper.com/#a%7Ca%% https://www.debuggex.com%% https://jex.im/regulex/#!embed=false&flags=&re=%5E(a%7Cb)*%3F%24%% email validator%% http://www.ex-parrot.com/%7Epdw/Mail-RFC822-Address.html%% regex testers% https://regex101.com% http://regexr.com%% emacs regexes%% https://www.gnu.org/software/emacs/manual/html_node/elisp/Regular-Expressions.html%% reasons for a new prgramming language%% http://beautifulracket.com%regher% We can start off with a couple of observations about the role of% compilers. First, hardware is getting weirder rather than getting% clocked faster: almost all processors are multicores and it looks% like there is increasing asymmetry in resources across% cores. Processors come with vector units, crypto accelerators, bit% twiddling instructions, and lots of features to make virtualization% and concurrency work. We have DSPs, GPUs, big.little, and Xeon% Phi. This is only scratching the surface. Second, we’re getting% tired of low-level languages and their associated security% disasters, we want to write new code, to whatever extent possible,% in safer, higher-level languages. Compilers are caught right in the% middle of these opposing trends: one of their main jobs is to help% bridge the large and growing gap between increasingly high-level% languages and increasingly wacky platforms. It’s effectively a% perpetual employment act for solid compiler hackers.% compiler explorer% https://gcc.godbolt.org\begin{document}\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016}\section*{Handout 1}This module is about text processing, be it for web-crawlers,compilers, dictionaries, DNA-data and so on. When looking fora particular string, like $abc$ in a large text we can use theKnuth-Morris-Pratt algorithm, which is currently the mostefficient general string search algorithm. But often we do\emph{not} just look for a particular string, but for stringpatterns. For example in program code we need to identify whatare the keywords, what are the identifiers etc. A pattern foridentifiers could be stated as: they start with a letter,followed by zero or more letters, numbers and underscores.Also often we face the problem that we are given a string (forexample some user input) and want to know whether it matches aparticular pattern---be it an email address, for example. Inthis way we can exclude user input that would otherwise havenasty effects on our program (crashing it or making it go intoan infinite loop, if not worse).\smallskip\defn{Regular expressions} help with conveniently specifyingsuch patterns. The idea behind regular expressions is thatthey are a simple method for describing languages (or sets ofstrings)\ldots at least languages we are interested in incomputer science. For example there is no convenient regularexpression for describing the English language short ofenumerating all English words. But they seem useful fordescribing for example simple email addresses.\footnote{See``8 Regular Expressions You Should Know''\url{http://goo.gl/5LoVX7}} Consider the following regularexpression\begin{equation}\label{email}\texttt{[a-z0-9\_.-]+} \;\;\texttt{@}\;\; \texttt{[a-z0-9.-]+} \;\;\texttt{.}\;\; \texttt{[a-z.]\{2,6\}}\end{equation}\noindent where the first part, the user name, matches one or more lowercaseletters (\pcode{a-z}), digits (\pcode{0-9}), underscores, dotsand hyphens. The \pcode{+} at the end of the brackets ensuresthe ``one or more''. Then comes the \pcode{@}-sign, followedby the domain name which must be one or more lowercaseletters, digits, underscores, dots or hyphens. Note therecannot be an underscore in the domain name. Finally there mustbe a dot followed by the toplevel domain. This toplevel domainmust be 2 to 6 lowercase letters including the dot. Examplestrings which follow this pattern are:\begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}]niceandsimple@example.orgvery.common@example.co.uka.little.lengthy.but.fine@dept.example.ac.ukother.email-with-dash@example.edu\end{lstlisting}\noindentBut for example the following two do not\begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}]user@localserverdisposable.style.email.with+symbol@example.com\end{lstlisting}\noindent according to the regular expression we specified in\eqref{email}. Whether this is intended or not is a differentquestion (the second email above is actually an acceptableemail address acording to the RFC 5322 standard for emailaddresses).As mentioned above, identifiers, or variables, in program codeare often required to satisfy the constraints that they startwith a letter and then can be followed by zero or more lettersor numbers and also can include underscores, but not as thefirst character. Such identifiers can be recognised with theregular expression\begin{center}\pcode{[a-zA-Z] [a-zA-Z0-9_]*}\end{center}\noindent Possible identifiers that match this regular expression are \pcode{x}, \pcode{foo}, \pcode{foo_bar_1}, \pcode{A_very_42_long_object_name},but not \pcode{_i} and also not \pcode{4you}.Many programming languages offer libraries that can be used tovalidate such strings against regular expressions. Also thereare some common, and I am sure very familiar, ways of how toconstruct regular expressions. For example in Scala we havea library implementing the following regular expressions: \begin{center}\begin{tabular}{lp{9cm}}\pcode{re*} & matches 0 or more occurrences of preceding expression\\\pcode{re+} & matches 1 or more occurrences of precedingexpression\\\pcode{re?} & matches 0 or 1 occurrence of preceding expression\\\pcode{re\{n\}} & matches exactly \pcode{n} number of occurrences of preceding expression\\\pcode{re\{n,m\}} & matches at least \pcode{n} and at most {\tt m}occurences of the preceding expression\\\pcode{[...]} & matches any single character inside the brackets\\\pcode{[^...]} & matches any single character not inside the brackets\\\pcode{...-...} & character ranges\\\pcode{\\d} & matches digits; equivalent to \pcode{[0-9]}\\\pcode{.} & matches every character except newline\\\pcode{(re)} & groups regular expressions and remembers matched text\end{tabular}\end{center}\noindent With this table you can figure out the purpose ofthe regular expressions in the web-crawlers shown Figures\ref{crawler1}, \ref{crawler2} and\ref{crawler3}. Note, however, the regular expression forhttp-addresses in web-pages in Figure~\ref{crawler1}, Line 15,is intended to be\[\pcode{"https?://[^"]*"}\]\noindent It specifies that web-addresses need to start with adouble quote, then comes \texttt{http} followed by an optional\texttt{s} and so on until the closing double quote comes.Usually we would have to escape the double quotes in order tomake sure we interpret the double quote as character, not asdouble quote for a string. But Scala's trick with triplequotes allows us to omit this kind of escaping. As a result wecan just write:\[\pcode{""""https?://[^"]*"""".r}\]\noindent Note also that the convention in Scala is that\texttt{.r} converts a string into a regular expression. Ileave it to you to ponder whether this regular expressionreally captures all possible web-addresses.\subsection*{Why Study Regular Expressions?}Regular expressions were introduced by Kleene in the 1950iesand they have been object of intense study since then. Theyare nowadays pretty much ubiquitous in computer science. Thereare many libraries implementing regular expressions. I am sureyou have come across them before (remember PRA?). Why on earththen is there any interest in studying them again in depth inthis module? Well, one answer is in the following two graphs aboutregular expression matching in Python, Ruby and Java.\begin{center}\begin{tabular}{@{\hspace{-1mm}}c@{\hspace{-1mm}}c@{}}\begin{tikzpicture}\begin{axis}[ title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$}, xlabel={$n$}, x label style={at={(1.05,0.0)}}, ylabel={time in secs}, enlargelimits=false, xtick={0,5,...,30}, xmax=33, ymax=35, ytick={0,5,...,30}, scaled ticks=false, axis lines=left, width=5.5cm, height=4.5cm, legend entries={Python,Ruby}, legend pos=north west, legend cell align=left]\addplot[blue,mark=*, mark options={fill=white}] table {re-python.data};\addplot[brown,mark=triangle*, mark options={fill=white}] table {re-ruby.data}; \end{axis}\end{tikzpicture}&\begin{tikzpicture}\begin{axis}[ title={Graph: $\texttt{(a*)*\,b}$ and strings $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$}, xlabel={$n$}, x label style={at={(1.05,0.0)}}, ylabel={time in secs}, enlargelimits=false, xtick={0,5,...,30}, xmax=33, ymax=35, ytick={0,5,...,30}, scaled ticks=false, axis lines=left, width=5.5cm, height=4.5cm, legend entries={Java}, legend pos=north west, legend cell align=left]\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};\end{axis}\end{tikzpicture}\end{tabular}\end{center}\noindent This first graph shows that Python needs approximately 29seconds for finding out whether a string of 28 \texttt{a}smatches the regular expression \texttt{a?\{28\}\,a\{28\}}.Ruby is even slightly worse.\footnote{In this example Rubyuses the slightly different regular expression\texttt{a?a?a?...a?a?aaa...aa}, where the \texttt{a?} and\texttt{a} each occur $n$ times. More such test cases can befound at \url{http://www.computerbytesman.com/redos/}.} Simlarly, Java needs approximately30 seconds to find out that the regular expression $\texttt{(a*)*\,b}$does not match strings of 28 \texttt{a}s.Admittedly, these regular expressions are carefully chosen toexhibit this exponential behaviour, but similar ones occurmore often than one wants in ``real life''. For example, on 20July 2016 a similar regular expression brought the webpage\href{http://stackexchange.com}{Stack Exchange} to its knees:\begin{center}\url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}\end{center}\noindent I can also highly recommend a cool webtalk from an engineerfrom Stack Exchange on the same subject:\begin{center}\url{https://vimeo.com/112065252}\end{center}\noindentA similar problem also occured in the Atom editor:\begin{center}\url{http://davidvgalbraith.com/how-i-fixed-atom/}\end{center}\noindentSuch troublesome regular expressions are sometimes called \emph{evil regular expressions} because they have the potential to make regularexpression matching engines to topple over, like in Python, Ruby andJava. This ``toppling over'' is also sometimes called \emph{catastrophic backtracking}. The problem with evil regular expressions is thatthey can have some serious consequences, for example, if you use themin your web-application. The reason is that hackers can look for theseinstances where the matching engine behaves badly and then mount anice DoS-attack against your application. These attacks are alreadyhave their own name: \emph{Regular Expression Denial of Service Attacks (ReDoS)}.It will be instructive to look behind the ``scenes'' to findout why Python and Ruby (and others) behave so badly whenmatching strings with evil regular expressions. But we will also lookat a relatively simple algorithm that solves this problem muchbetter than Python and Ruby do\ldots actually it will be twoversions of the algorithm: the first one will be able toprocess strings of approximately 1,000 \texttt{a}s in 30seconds, while the second version will even be able to processup to 12,000 in less than 10(!) seconds, see the graph below:\begin{center}\begin{tikzpicture} \begin{axis}[ title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$}, xlabel={$n$}, x label style={at={(1.05,0.0)}}, ylabel={time in secs}, enlargelimits=false, xtick={0,3000,...,12000}, xmax=13000, ymax=12, ytick={0,5,10}, scaled ticks=false, axis lines=left, width=7cm, height=4.5cm, legend entries={Our Algorithm V1, Our Algorithm V2}, legend pos=outer north east]\addplot[green,mark=square*,mark options={fill=white}] table {re2.data};\addplot[black,mark=square*,mark options={fill=white}] table {re3.data};\end{axis}\end{tikzpicture}\end{center}\noindent And in the case of the regular expression $\texttt{(a*)*\,b}$and strings of \texttt{a}s we will beat Java by factor ofapproximately 1,000,000 (note the scale on the $x$-axis). \begin{center}\begin{tikzpicture} \begin{axis}[ title={Graph: $\texttt{(a*)*\,b}$ and strings $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$}, xlabel={$n$}, x label style={at={(1.05,0.0)}}, ylabel={time in secs}, enlargelimits=false, ymax=12, ytick={0,5,10}, axis lines=left, width=7cm, height=4.5cm, legend entries={Our Algorithm V2}, legend pos=outer north east]\addplot[black,mark=square*,mark options={fill=white}] table {re3a.data};\end{axis}\end{tikzpicture}\end{center}\subsection*{Basic Regular Expressions}The regular expressions shown earlier for Scala, wewill call \emph{extended regular expressions}. The ones wewill mainly study in this module are \emph{basic regularexpressions}, which by convention we will just call\emph{regular expressions}, if it is clear what we mean. Theattraction of (basic) regular expressions is that manyfeatures of the extended ones are just syntactic sugar.(Basic) regular expressions are defined by the followinggrammar:\begin{center}\begin{tabular}{r@{\hspace{1mm}}r@{\hspace{1mm}}l@{\hspace{13mm}}l} $r$ & $::=$ & $\ZERO$ & null language\\ & $\mid$ & $\ONE$ & empty string / \texttt{""} / []\\ & $\mid$ & $c$ & single character\\ & $\mid$ & $r_1 + r_2$ & alternative / choice\\ & $\mid$ & $r_1 \cdot r_2$ & sequence\\ & $\mid$ & $r^*$ & star (zero or more)\\ \end{tabular}\end{center}\noindent Because we overload our notation, there are somesubtleties you should be aware of. When regular expressionsare referred to, then $\ZERO$ (in bold font) does not stand forthe number zero: rather it is a particular pattern that doesnot match any string. Similarly, in the context of regularexpressions, $\ONE$ does not stand for the number one but fora regular expression that matches the empty string. The letter$c$ stands for any character from the alphabet at hand. Againin the context of regular expressions, it is a particularpattern that can match the specified character. You shouldalso be careful with our overloading of the star: assuming youhave read the handout about our basic mathematical notation,you will see that in the context of languages (sets ofstrings) the star stands for an operation on languages. Here$r^*$ stands for a regular expression, which is different fromthe operation on sets is defined as\[A\star\dn \bigcup_{0\le n} A^n\]We will use parentheses to disambiguate regular expressions.Parentheses are not really part of a regular expression, andindeed we do not need them in our code because there the treestructure of regular expressions is always clear. But forwriting them down in a more mathematical fashion, parentheseswill be helpful. For example we will write $(r_1 + r_2)^*$,which is different from, say $r_1 + (r_2)^*$. The former meansroughly zero or more times $r_1$ or $r_2$, while the lattermeans $r_1$ or zero or more times $r_2$. This will turn out tobe two different patterns, which match in general differentstrings. We should also write $(r_1 + r_2) + r_3$, which isdifferent from the regular expression $r_1 + (r_2 + r_3)$, butin case of $+$ and $\cdot$ we actually do not care about theorder and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2\cdot r_3$, respectively. The reasons for this will becomeclear shortly. In the literature you will often find that the choice $r_1 +r_2$ is written as $r_1\mid{}r_2$ or $r_1\mid\mid{}r_2$. Also,often our $\ZERO$ and $\ONE$ are written $\varnothing$ and$\epsilon$, respectively. Following the convention in theliterature, we will often omit the $\cdot$ all together. Thisis to make some concrete regular expressions more readable.For example the regular expression for email addresses shownin \eqref{email} would look like\[\texttt{[...]+} \;\cdot\; \texttt{@} \;\cdot\; \texttt{[...]+} \;\cdot\; \texttt{.} \;\cdot\; \texttt{[...]\{2,6\}}\]\noindentwhich is much less readable than \eqref{email}. Similarly forthe regular expression that matches the string $hello$ we should write\[h \cdot e \cdot l \cdot l \cdot o\]\noindentbut often just write {\it hello}.If you prefer to think in terms of the implementationof regular expressions in Scala, the constructors andclasses relate as follows\footnote{More about Scala is in the handout about \emph{A Crash-Course on Scala}.}\begin{center}\begin{tabular}{rcl}$\ZERO$ & $\mapsto$ & \texttt{ZERO}\\$\ONE$ & $\mapsto$ & \texttt{ONE}\\$c$ & $\mapsto$ & \texttt{CHAR(c)}\\$r_1 + r_2$ & $\mapsto$ & \texttt{ALT(r1, r2)}\\$r_1 \cdot r_2$ & $\mapsto$ & \texttt{SEQ(r1, r2)}\\$r^*$ & $\mapsto$ & \texttt{STAR(r)}\end{tabular}\end{center}A source of confusion might arise from the fact that weuse the term \emph{basic regular expression} for the regularexpressions used in ``theory'' and defined above, and\emph{extended regular expression} for the ones used in``practice'', for example in Scala. If runtime is not anissue, then the latter can be seen as syntactic sugar ofthe former. For example we could replace\begin{center}\begin{tabular}{rcl}$r+$ & $\mapsto$ & $r\cdot r^*$\\$r?$ & $\mapsto$ & $\ONE + r$\\$\backslash d$ & $\mapsto$ & $0 + 1 + 2 + \ldots + 9$\\$[\text{\it a - z}]$ & $\mapsto$ & $a + b + \ldots + z$\\\end{tabular}\end{center}\subsection*{The Meaning of Regular Expressions}So far we have only considered informally what the\emph{meaning} of a regular expression is. This is not goodenough for specifications of what algorithms are supposed todo or which problems they are supposed to solve.To define the meaning of a regular expression we willassociate with every regular expression a language, or set ofstrings. This language contains all the strings the regularexpression is supposed to match. To understand what is goingon here it is crucial that you have read the handoutabout basic mathematical notations.The \defn{meaning of a regular expression} can be definedby a recursive function called $L$ (for language), whichis defined as follows\begin{center}\begin{tabular}{rcll}$L(\ZERO)$ & $\dn$ & $\{\}$\\$L(\ONE)$ & $\dn$ & $\{[]\}$\\$L(c)$ & $\dn$ & $\{"c"\}$ & or equivalently $\dn \{[c]\}$\\$L(r_1+ r_2)$ & $\dn$ & $L(r_1) \cup L(r_2)$\\$L(r_1 \cdot r_2)$ & $\dn$ & $L(r_1) \,@\, L(r_2)$\\$L(r^*)$ & $\dn$ & $(L(r))\star$\\\end{tabular}\end{center}\noindent As a result we can now precisely state what themeaning, for example, of the regular expression $h \cdote \cdot l \cdot l \cdot o$ is, namely\[L(h \cdot e \cdot l \cdot l \cdot o) = \{"hello"\}\]\noindent This is expected because this regular expression is only supposed to match the ``$hello$''-string. Similarly ifwe have the choice-regular-expression $a + b$, its meaning is\[L(a + b) = \{"a", "b"\}\]\noindent You can now also see why we do not make a differencebetween the different regular expressions $(r_1 + r_2) + r_3$and $r_1 + (r_2 + r_3)$\ldots they are not the same regularexpression, but they have the same meaning. For exampleyou can do the following calculation which shows theyhave the same meaning:\begin{eqnarray*}L((r_1 + r_2) + r_3) & = & L(r_1 + r_2) \cup L(r_3)\\ & = & L(r_1) \cup L(r_2) \cup L(r_3)\\ & = & L(r_1) \cup L(r_2 + r_3)\\ & = & L(r_1 + (r_2 + r_3))\end{eqnarray*}The point of the definition of $L$ is that we can use it toprecisely specify when a string $s$ is matched by a regularexpression $r$, namely if and only if $s \in L(r)$. In fact wewill write a program \pcode{match} that takes any string $s$and any regular expression $r$ as argument and returns\emph{yes}, if $s \in L(r)$ and \emph{no}, if $s \not\inL(r)$. We leave this for the next lecture.There is one more feature of regular expressions that is worthmentioning. Given some strings, there are in general manydifferent regular expressions that can recognise thesestrings. This is obvious with the regular expression $a + b$which can match the strings $a$ and $b$. But also the regularexpression $b + a$ would match the same strings. However,sometimes it is not so obvious whether two regular expressionsmatch the same strings: for example do $r^*$ and $\ONE + r\cdot r^*$ match the same strings? What about $\ZERO^*$and $\ONE^*$? This suggests the following relation between\defn{equivalent regular expressions}: \[r_1 \equiv r_2 \;\dn\; L(r_1) = L(r_2)\]\noindent That means two regular expressions are said to beequivalent if they match the same set of strings. Therefore wedo not really distinguish between the different regularexpression $(r_1 + r_2) + r_3$ and $r_1 + (r_2 + r_3)$,because they are equivalent. I leave you to the questionwhether\[\ZERO^* \equiv \ONE^*\]\noindent holds or not? Such equivalences will be importantfor our matching algorithm, because we can use them tosimplify regular expressions, which will mean we can speedup the calculations. \subsection*{My Fascination for Regular Expressions}Up until a few years ago I was not really interested inregular expressions. They have been studied for the last 60years (by smarter people than me)---surely nothing new can befound out about them. I even have the vague recollection thatI did not quite understand them during my undergraduate study. If I remembercorrectly,\footnote{That was really a long time ago.} I gotutterly confused about $\ONE$ (which my lecturer wrote as$\epsilon$) and the empty string.\footnote{Obviously thelecturer must have been bad.} Since my then, I have usedregular expressions for implementing lexers and parsers as Ihave always been interested in all kinds of programminglanguages and compilers, which invariably need regularexpressions in some form or shape. To understand my fascination \emph{nowadays} with regularexpressions, you need to know that my main scientific interestfor the last 14 years has been with theorem provers. I am acore developer of one ofthem.\footnote{\url{http://isabelle.in.tum.de}} Theoremprovers are systems in which you can formally reason aboutmathematical concepts, but also about programs. In this waytheorem prover can help with the manacing problem of writing bug-free code. Theorem provers haveproved already their value in a number of cases (even interms of hard cash), but they are still clunky and difficultto use by average programmers.Anyway, in about 2011 I came across the notion of\defn{derivatives of regular expressions}. This notion allowsone to do almost all calculations in regular language theoryon the level of regular expressions, not needing any automata (you willsee we only touch briefly on automata in lecture 3).This is crucial because automata are graphs and it is ratherdifficult to reason about graphs in theorem provers. Incontrast, reasoning about regular expressions is easy-peasy intheorem provers. Is this important? I think yes, becauseaccording to Kuklewicz nearly all POSIX-based regularexpression matchers arebuggy.\footnote{\url{http://www.haskell.org/haskellwiki/Regex_Posix}}With my PhD student Fahad Ausaf I provedthe correctness for one such matcher that wasproposed by Sulzmann and Lu in2014.\footnote{\url{http://goo.gl/bz0eHp}} Hopefully we canprove that the machine code(!)that implements this code efficiently is correct also. Writingprograms in this way does not leave any room for potentialerrors or bugs. How nice!What also helped with my fascination with regular expressionsis that we could indeed find out new things about them thathave surprised some experts. Together with two colleagues from China, I wasable to prove the Myhill-Nerode theorem by only using regularexpressions and the notion of derivatives. Earlier versions ofthis theorem used always automata in the proof. Using thistheorem we can show that regular languages are closed undercomplementation, something which Gasarch in hisblog\footnote{\url{http://goo.gl/2R11Fw}} assumed can only beshown via automata. Even sombody who has written a 700+-pagebook\footnote{\url{http://goo.gl/fD0eHx}} on regularexprssions did not know better. Well, we showed it can also bedone with regular expressions only.\footnote{\url{http://www.inf.kcl.ac.uk/staff/urbanc/Publications/rexp.pdf}} What a feeling if you are an outsider to the subject!To conclude: Despite my early ignorance about regularexpressions, I find them now very interesting. They have abeautiful mathematical theory behind them, which can besometimes quite deep and which sometimes contains hidden snares. They havepractical importance (remember the shocking runtime of theregular expression matchers in Python, Ruby and Java in someinstances and the problems in Stack Exchange and the Atom editor). People who are not very familiar with themathematical background of regular expressions get themconsistently wrong. The hope is that we can do better in thefuture---for example by proving that the algorithms actuallysatisfy their specification and that the correspondingimplementations do not contain any bugs. We are close, but notyet quite there.Notwithstanding my fascination, I am also happy to admit that regularexpressions have their shortcomings. There are some well-known``theoretical'' shortcomings, for example recognising stringsof the form $a^{n}b^{n}$. I am not so bothered by them. What Iam bothered about is when regular expressions are in the wayof practical programming. For example, it turns out that theregular expression for email addresses shown in \eqref{email}is hopelessly inadequate for recognising all of them (despitebeing touted as something every computer scientist should knowabout). The W3 Consortium (which standardises the Web)proposes to use the following, already more complicatedregular expressions for email addresses:{\small\begin{lstlisting}[language={},keywordstyle=\color{black},numbers=none][a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*\end{lstlisting}}\noindent But they admit that by using this regular expressionthey wilfully violate the RFC 5322 standard which specifiesthe syntax of email addresses. With their proposed regularexpression they are too strict in some cases and too lax inothers. Not a good situation to be in. A regular expressionthat is claimed to be closer to the standard is shown inFigure~\ref{monster}. Whether this claim is true or not, Iwould not know---the only thing I can say about this regularexpression is it is a monstrosity. However, this mightactually be an argument against the RFC standard, rather thanagainst regular expressions. A similar argument is made in\begin{center}\url{https://elliot.land/validating-an-email-address}\end{center}\noindent which explains some of the crazier parts of emailaddresses. Still it is good to know that some tasks in textprocessing just cannot be achieved by using regularexpressions.\begin{figure}[p]\lstinputlisting{../progs/crawler1.scala}\caption{The Scala code for a simple web-crawler that checksfor broken links in a web-page. It uses the regular expression\texttt{http\_pattern} in Line~\ref{httpline} for recognisingURL-addresses. It finds all links using the library function\texttt{findAllIn} in Line~\ref{findallline}.\label{crawler1}}\end{figure}\begin{figure}[p]\lstinputlisting{../progs/crawler2.scala}\caption{A version of the web-crawler that only follows linksin ``my'' domain---since these are the ones I am interested into fix. It uses the regular expression \texttt{my\_urls} inLine~\ref{myurlline} to check for my name in the links. Themain change is inLines~\ref{changestartline}--\ref{changeendline} where thereis a test whether URL is in ``my'' domain ornot.\label{crawler2}}\end{figure}\begin{figure}[p]\lstinputlisting{../progs/crawler3.scala}\caption{A small email harvester---whenever we download aweb-page, we also check whether it contains any emailaddresses. For this we use the regular expression\texttt{email\_pattern} in Line~\ref{emailline}. The mainchange is in Line~\ref{mainline} where all email addressesthat can be found in a page are printed.\label{crawler3}}\end{figure}\begin{figure}[p]\tiny\begin{center}\begin{minipage}{0.8\textwidth}\lstinputlisting[language={},keywordstyle=\color{black},numbers=none]{../progs/email-rexp}\end{minipage}\end{center}\caption{Nothing that can be said about this regularexpression\ldots{}except it is a monstrosity.\label{monster}}\end{figure}\end{document}%%% Local Variables: %%% mode: latex%%% TeX-master: t%%% End: