% !TEX program = xelatex\documentclass{article}\usepackage{../style}\usepackage{../langs}\usepackage{../graphicss}\usepackage{../data}\usepackage{lstlinebgrd}\definecolor{capri}{rgb}{0.0, 0.75, 1.0}%%http://regexcrossword.com/challenges/cities/puzzles/1%%https://jex.im/regulex/%%https://www.reddit.com/r/ProgrammingLanguages/comments/42dlem/mona_compiler_development_part_2_parsing/%%https://www.reddit.com/r/ProgrammingLanguages/comments/43wlkq/formal_grammar_for_csh_tsch_sh_or_bash/%% regex displayers%% https://regexper.com/#a%7Ca%% https://www.debuggex.com%% https://jex.im/regulex/#!embed=false&flags=&re=%5E(a%7Cb)*%3F%24%% email validator%% http://www.ex-parrot.com/%7Epdw/Mail-RFC822-Address.html% https://jackfoxy.github.io/FsRegEx/emailregex.html%% regex testers% https://regex101.com% http://regexr.com%% emacs regexes%% https://www.gnu.org/software/emacs/manual/html_node/elisp/Regular-Expressions.html%% reasons for a new programming language%% http://beautifulracket.com% compiler explorer% https://gcc.godbolt.org% good article how languages have been influenced% 10 MOST(LY DEAD) INFLUENTIAL PROGRAMMING LANGUAGES% https://www.hillelwayne.com/post/influential-dead-languages/\begin{document}\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021}\section*{Handout 1}The purpose of a compiler is to transform a program a human can read andwrite into code machines can run as fast as possible. Developing acompiler is an old craft going back to 1952 with the first compilerwritten by Grace Hopper.\footnote{Who many years ago was invited on atalk show hosted by David Letterman.\here{https://youtu.be/3N_ywhx6_K0?t=31}} Why studying compilersnowadays? An interesting answer is given by John Regher in his compilerblog:\here{http://blog.regehr.org/archives/1419}\begin{quote}\it{}``We can start off with a couple of observations about the role of compilers. First, hardware is getting weirder rather than getting clocked faster: almost all processors are multicores and it looks like there is increasing asymmetry in resources across cores. Processors come with vector units, crypto accelerators, bit twiddling instructions, and lots of features to make virtualization and concurrency work. We have DSPs, GPUs, big.little, and Xeon Phi. This is only scratching the surface. Second, we’re getting tired of low-level languages and their associated security disasters, we want to write new code, to whatever extent possible, in safer, higher-level languages. Compilers are caught right in the middle of these opposing trends: one of their main jobs is to help bridge the large and growing gap between increasingly high-level languages and increasingly wacky platforms. It’s effectively a perpetual employment act for solid compiler hackers.''\end{quote} \noindentGiven this, the goal of this module is to become a solid (beginner) compilerhacker and as part of the coursework to implement two smallcompilers for two very small programming languages.The first part of the module is about the problem of text processing,which is needed for compilers, but also for dictionaries, DNA-data,spam-filters and so on. When looking for a particular string, say\pcode{"foobar"}, in a large text we can use the Knuth-Morris-Prattalgorithm, which is currently the most efficient general string searchalgorithm. But often we do \emph{not} just look for a particular string,but for string patterns. For example, in program code we need toidentify what are the keywords (\texttt{if}, \texttt{then},\texttt{while}, \texttt{for}, etc) and what are the identifiers(variable names). A pattern for identifiers could be stated as: theystart with a letter, followed by zero or more letters, numbers andunderscores. %You might also be surprised what%constraints programming languages impose about numbers: for example%123 in JSON is OK, but 0123 is not.%% The regex for JASON numbers is%% -?(0|[1-9][0-9]*)(\.[0-9]+)?([eE][+-]?[0-9]+)?Often we also face the problem that we are given a string, for examplesome user input, and we want to know whether it matches a particularpattern---is it an email address, for example. In this way we canexclude user input that would otherwise have nasty effects on ourprogram (crashing it or making it go into an infinite loop, if notworse). This kind of ``vetting'' mechanism is also implemented inpopular network security tools such as Snort andZeek.\here{www.snort.org}\here{www.bro.org} They scan incomingnetwork traffic for computer viruses or malicious packets. Similarlyfiltering out spam usually involves looking for some signature(essentially a string pattern). The point is that the fastKnuth-Morris-Pratt algorithm for strings is not good enough for suchstring \emph{patterns}.\smallskip\defn{Regular expressions} help with conveniently specifyingsuch patterns. The idea behind regular expressions is thatthey are a simple method for describing languages (or sets ofstrings)\ldots at least languages we are interested in incomputer science. For example there is no convenient regularexpression for describing the English language short ofenumerating all English words. But they seem useful fordescribing for example simple email addresses.\footnote{See``8 Regular Expressions You Should Know''\url{http://goo.gl/5LoVX7}} Consider the following regularexpression\begin{equation}\label{email}\texttt{[a-z0-9\_.-]+} \;\;\texttt{@}\;\; \texttt{[a-z0-9.-]+} \;\;\texttt{.}\;\; \texttt{[a-z.]\{2,6\}}\end{equation}\noindent where the first part, the user name, matches one or morelowercase letters (\pcode{a-z}), digits (\pcode{0-9}), underscores, dotsand hyphens. The \pcode{+} at the end of the brackets ensures the ``oneor more''. Then comes the email \pcode{@}-sign, followed by the domainname which must be one or more lowercase letters, digits, underscores,dots or hyphens (but no underscores). Finally there must be a dotfollowed by the toplevel domain. This toplevel domain must be 2 to 6lowercase letters including the dot. Example strings which follow thispattern are:\begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}]niceandsimple@example.orgvery.common@example.co.uka.little.lengthy.but.fine@dept.example.ac.ukother.email-with-dash@example.edu\end{lstlisting}\noindentBut for example the following two do not\begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}]user@localserverdisposable.style.email.with+symbol@example.com\end{lstlisting}\noindent according to the regular expression we specified in line\eqref{email} above. Whether this is intended or not is a differentquestion (the second email above is actually an acceptable emailaddress according to the RFC 5322 standard for email addresses).As mentioned above, identifiers, or variables, in program codeare often required to satisfy the constraints that they startwith a letter and then can be followed by zero or more lettersor numbers and also can include underscores, but not as thefirst character. Such identifiers can be recognised with theregular expression\begin{center}\pcode{[a-zA-Z] [a-zA-Z0-9_]*}\end{center}\noindent Possible identifiers that match this regular expression are \pcode{x}, \pcode{foo}, \pcode{foo_bar_1}, \pcode{A_very_42_long_object_name},but not \pcode{_i} and also not \pcode{4you}.Many programming languages offer libraries that can be used tovalidate such strings against regular expressions. Also thereare some common, and I am sure very familiar, ways of how toconstruct regular expressions. For example in Scala we havea library implementing the following regular expressions: \begin{center}\begin{tabular}{lp{9cm}}\pcode{re*} & matches 0 or more occurrences of preceding expression\\\pcode{re+} & matches 1 or more occurrences of precedingexpression\\\pcode{re?} & matches 0 or 1 occurrence of preceding expression\\\pcode{re\{n\}} & matches exactly \pcode{n} number of occurrences of preceding expression\\\pcode{re\{n,m\}} & matches at least \pcode{n} and at most {\tt m}occurrences of the preceding expression\\\pcode{[...]} & matches any single character inside the brackets\\\pcode{[^...]} & matches any single character not inside the brackets\\\pcode{...-...} & character ranges\\\pcode{\\d} & matches digits; equivalent to \pcode{[0-9]}\\\pcode{.} & matches every character except newline\\\pcode{(re)} & groups regular expressions and remembers matched text\end{tabular}\end{center}\noindent With this table you can figure out the purpose of the regularexpressions in the web-crawlers shown Figures \ref{crawler1} and\ref{crawler3}. In Figure~\ref{crawler1}, however, be careful withthe regular expression for http-addresses in Line~\ref{httpline}. It isintended to be\[\pcode{"https?://[^"]*"}\]\noindent specifying that http-addresses need to start with a doublequote, then comes \texttt{http} followed by an optional \texttt{s} andso on\ldots{}until the closing double quote comes at the end of theaddress. Normally we would have to escape the double quotes in order tomake sure we interpret the double quote as character, not as doublequote for a string. But Scala's trick with triple quotes allows us toomit this kind of ugly escaping. As a result we can just write:\[\pcode{""""https?://[^"]*"""".r}\]\noindent The convention in Scala is that \texttt{.r} converts a stringinto a regular expression. I leave it to you to ponder whether thisregular expression really captures all possible web-addresses. If youneed a quick recap about regular expressions and how the match strings,here is a quick video: \url{https://youtu.be/bgBWp9EIlMM}.\subsection*{Why Study Regular Expressions?}Regular expressions were introduced by Kleene in the 1950ies and theyhave been object of intense study since then. They are nowadays prettymuch ubiquitous in computer science. There are many librariesimplementing regular expressions. I am sure you have come across thembefore (remember the PRA or PEP modules?). Why on earth then is there any interest in studying them again in depthin this module? Well, one answer is in the following two graphs aboutregular expression matching in Python, Ruby, JavaScript and Java(Version 8).\begin{center}\begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1.5mm}}c@{}}\begin{tikzpicture}\begin{axis}[ title={Graph: $\texttt{(a*)*\,b}$ and strings $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$}, xlabel={$n$}, x label style={at={(1.05,0.0)}}, ylabel={time in secs}, enlargelimits=false, xtick={0,5,...,30}, xmax=33, ymax=35, ytick={0,5,...,30}, scaled ticks=false, axis lines=left, width=5.5cm, height=4.5cm, legend entries={Python, Java 8, JavaScript}, legend pos=north west, legend cell align=left]\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};\addplot[red,mark=*, mark options={fill=white}] table {re-js.data};\end{axis}\end{tikzpicture}&\begin{tikzpicture}\begin{axis}[ title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$}, xlabel={$n$}, x label style={at={(1.05,0.0)}}, ylabel={time in secs}, enlargelimits=false, xtick={0,5,...,30}, xmax=33, ymax=35, ytick={0,5,...,30}, scaled ticks=false, axis lines=left, width=5.5cm, height=4.5cm, legend entries={Python,Ruby}, legend pos=north west, legend cell align=left]\addplot[blue,mark=*, mark options={fill=white}] table {re-python.data};\addplot[brown,mark=triangle*, mark options={fill=white}] table {re-ruby.data}; \end{axis}\end{tikzpicture}\end{tabular}\end{center}\noindent The first graph shows that Python, JavaScript and Java 8 needapproximately 30 seconds to find out that the regular expression$\texttt{(a*)*\,b}$ does not match strings of 28 \texttt{a}s. Similarly,the second shows that Python and Ruby need approximately 29 seconds for findingout whether a string of 28 \texttt{a}s matches the regular expression\texttt{a?\{28\}\,a\{28\}}.\footnote{In thisexample Ruby uses actually the slightly different regular expression\texttt{a?a?a?...a?a?aaa...aa}, where the \texttt{a?} and \texttt{a}each occur $n$ times. More such test cases can be found at\url{https://www.owasp.org/index.php/Regular_expression_Denial_of_Service_-_ReDoS}.}Admittedly, these regular expressions are carefully chosen to exhibitthis exponential behaviour, but similar ones occur more often than onewants in ``real life''. For example, on 20 July 2016 a similar regularexpression brought the webpage \href{http://stackexchange.com}{StackExchange} to its knees:\begin{center}\url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}\end{center}\noindent I can also highly recommend a cool webtalk from an engineerfrom Stack Exchange on the same subject:\begin{center}\url{https://vimeo.com/112065252}\end{center}\noindentA similar problem also occurred in the Atom editor:\begin{center}\url{http://davidvgalbraith.com/how-i-fixed-atom/}\end{center}\noindentand also when somebody tried to match web-addresses using a relatively simple regular expression\begin{center}\url{https://www.tutorialdocs.com/article/regex-trap.html}\end{center} \noindentFinally, on 2 July 2019 Cloudflare had a global outage because of a regular expression (they had no outage for the 6 years before). Whathappened is nicely explained in the blog:\begin{center}\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019}\end{center} Such troublesome regular expressions are sometimes called \emph{evilregular expressions} because they have the potential to make regularexpression matching engines to topple over, like in Python, Ruby,JavaScript and Java 8. This ``toppling over'' is also sometimes called\emph{catastrophic backtracking}. I have also seen the term\emph{eternal matching} used for this. The problem with evil regularexpressions and catastrophic backtracking is that they can have someserious consequences, for example, if you use them in yourweb-application. The reason is that hackers can look for these instanceswhere the matching engine behaves badly and then mount a nice DoS-attackagainst your application. These attacks are already have their own name:\emph{Regular Expression Denial of Service Attacks (ReDoS)}.It will be instructive to look behind the ``scenes'' to find out whyPython and Ruby (and others) behave so badly when matching strings withevil regular expressions. But we will also look at a relatively simplealgorithm that solves this problem much better than Python and Rubydo\ldots actually it will be two versions of the algorithm: the firstone will be able in the example \texttt{a?\{n\}\,a\{n\}} to process strings ofapproximately 1,100 \texttt{a}s in 23 seconds, while the second versionwill even be able to process up to 11,000(!) in 5 seconds, see the graphbelow:\begin{center}\begin{tikzpicture} \begin{axis}[ title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$}, xlabel={$n$}, x label style={at={(1.05,0.0)}}, ylabel={time in secs}, enlargelimits=false, xtick={0,3000,...,12000}, xmax=13000, ymax=32, ytick={0,5,...,30}, scaled ticks=false, axis lines=left, width=7cm, height=4.4cm, legend entries={Our Algorithm V1, Our Algorithm V2}, legend pos=outer north east]\addplot[green,mark=square*,mark options={fill=white}] table {re2.data};\addplot[black,mark=square*,mark options={fill=white}] table {re3.data};\end{axis}\end{tikzpicture}\end{center}\noindent And in the case of the regular expression $\texttt{(a*)*\,b}$and strings of \texttt{a}s we will beat Java 8 by factor ofapproximately 1,000,000 (note the scale on the $x$-axis). \begin{center}\begin{tikzpicture} \begin{axis}[ title={Graph: $\texttt{(a*)*\,b}$ and strings $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$}, xlabel={$n$}, x label style={at={(1.05,0.0)}}, ylabel={time in secs}, enlargelimits=false, ymax=32, ytick={0,5,...,30}, axis lines=left, width=7cm, height=4.5cm, legend entries={Our Algorithm V2}, legend pos=outer north east]\addplot[black,mark=square*,mark options={fill=white}] table {re3a.data};\end{axis}\end{tikzpicture}\end{center}\noindentYou might have wondered above why I looked at the (now) old Java 8: thereason is that Java 9 and later versions are a bit better, but we willstill beat them hands down with our regex matcher.\subsection*{Basic Regular Expressions}The regular expressions shown earlier for Scala, wewill call \emph{extended regular expressions}. The ones wewill mainly study in this module are \emph{basic regularexpressions}, which by convention we will just call\emph{regular expressions}, if it is clear what we mean. Theattraction of (basic) regular expressions is that manyfeatures of the extended ones are just syntactic sugar.(Basic) regular expressions are defined by the followinggrammar:\begin{center}\begin{tabular}{r@{\hspace{1mm}}r@{\hspace{1mm}}l@{\hspace{13mm}}l} $r$ & $::=$ & $\ZERO$ & null language\\ & $\mid$ & $\ONE$ & empty string / \texttt{""} / []\\ & $\mid$ & $c$ & single character\\ & $\mid$ & $r_1 + r_2$ & alternative / choice\\ & $\mid$ & $r_1 \cdot r_2$ & sequence\\ & $\mid$ & $r^*$ & star (zero or more)\\ \end{tabular}\end{center}\noindent Because we overload our notation, there are somesubtleties you should be aware of. When regular expressionsare referred to, then $\ZERO$ (in bold font) does not stand forthe number zero: rather it is a particular pattern that doesnot match any string. Similarly, in the context of regularexpressions, $\ONE$ does not stand for the number one but fora regular expression that matches the empty string. The letter$c$ stands for any character from the alphabet at hand. Againin the context of regular expressions, it is a particularpattern that can match the specified character. You shouldalso be careful with our overloading of the star: assuming youhave read the handout about our basic mathematical notation,you will see that in the context of languages (sets ofstrings) the star stands for an operation on languages. Here$r^*$ stands for a regular expression, which is different fromthe operation on sets is defined as\[A\star\dn \bigcup_{0\le n} A^n\]We will use parentheses to disambiguate regular expressions.Parentheses are not really part of a regular expression, andindeed we do not need them in our code because there the treestructure of regular expressions is always clear. But forwriting them down in a more mathematical fashion, parentheseswill be helpful. For example we will write $(r_1 + r_2)^*$,which is different from, say $r_1 + (r_2)^*$. This can beseen if we write regular expressions as trees:\begin{center}\includegraphics[scale=0.65]{../pics/r1.pdf}\hspace{3cm}\includegraphics[scale=0.65]{../pics/r2.pdf}\end{center}\noindentThe regular expression on the left meansroughly zero or more times $r_1$ or $r_2$, while the one on the rightmeans $r_1$, or zero or more times $r_2$. This will turn out tobe two different patterns, which match in general differentstrings. We should also write $(r_1 + r_2) + r_3$, which isdifferent from the regular expression $r_1 + (r_2 + r_3)$, butin case of $+$ and $\cdot$ we actually do not care about theorder and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2\cdot r_3$, respectively. The reasons for this carelessness will becomeclear shortly. In the literature you will often find that the choice $r_1 +r_2$ is written as $r_1\mid{}r_2$ or $r_1\mid\mid{}r_2$. Also,often our $\ZERO$ and $\ONE$ are written $\varnothing$ and$\epsilon$, respectively. Following the convention in theliterature, we will often omit the $\cdot$. Thisis to make some concrete regular expressions more readable.For example the regular expression for email addresses shownin \eqref{email} would fully expanded look like\[\texttt{[...]+} \;\cdot\; \texttt{@} \;\cdot\; \texttt{[...]+} \;\cdot\; \texttt{.} \;\cdot\; \texttt{[...]\{2,6\}}\]\noindentwhich is much less readable than the regular expression in\eqref{email}. Similarly for the regular expression that matches thestring $hello$ we should write\[h \cdot e \cdot l \cdot l \cdot o\]\noindentbut often just write {\it hello}.If you prefer to think in terms of the implementationof regular expressions in Scala, the constructors andclasses relate as follows\footnote{More about Scala is in the handout about \emph{A Crash-Course on Scala} from PEP.}\begin{center}\begin{tabular}{rcl}$\ZERO$ & $\mapsto$ & \texttt{ZERO}\\$\ONE$ & $\mapsto$ & \texttt{ONE}\\$c$ & $\mapsto$ & \texttt{CHAR(c)}\\$r_1 + r_2$ & $\mapsto$ & \texttt{ALT(r1, r2)}\\$r_1 \cdot r_2$ & $\mapsto$ & \texttt{SEQ(r1, r2)}\\$r^*$ & $\mapsto$ & \texttt{STAR(r)}\end{tabular}\end{center}A source of confusion might arise from the fact that weuse the term \emph{basic regular expression} for the regularexpressions used in ``theory'' and defined above, and\emph{extended regular expression} for the ones used in``practice'', for example in Scala. If runtime is not anissue, then the latter can be seen as syntactic sugar ofthe former. For example we could replace\begin{center}\begin{tabular}{rcl}$r^+$ & $\mapsto$ & $r\cdot r^*$\\$r^?$ & $\mapsto$ & $\ONE + r$\\$\backslash d$ & $\mapsto$ & $0 + 1 + 2 + \ldots + 9$\\$[\text{\it a - z}]$ & $\mapsto$ & $a + b + \ldots + z$\\\end{tabular}\end{center}\subsection*{The Meaning of Regular Expressions}So far we have only considered informally what the\emph{meaning} of a regular expression is. This is not goodenough for specifications of what algorithms are supposed todo or which problems they are supposed to solve.To define the meaning of a regular expression we willassociate with every regular expression a language---a set ofstrings. This language contains all the strings the regularexpression is supposed to match. To understand what is goingon here it is crucial that you have read the handoutabout basic mathematical notations.The \defn{meaning of a regular expression} can be definedby a recursive function called $L$ (for language), whichis defined as follows\begin{center}\begin{tabular}{rcll}$L(\ZERO)$ & $\dn$ & $\{\}$\\$L(\ONE)$ & $\dn$ & $\{[]\}$\\$L(c)$ & $\dn$ & $\{"c"\}$ & or equivalently $\dn \{[c]\}$\\$L(r_1+ r_2)$ & $\dn$ & $L(r_1) \cup L(r_2)$\\$L(r_1 \cdot r_2)$ & $\dn$ & $L(r_1) \,@\, L(r_2)$\\$L(r^*)$ & $\dn$ & $(L(r))\star$\\\end{tabular}\end{center}\noindent As a result we can now precisely state what themeaning, for example, of the regular expression $h \cdote \cdot l \cdot l \cdot o$ is, namely\[L(h \cdot e \cdot l \cdot l \cdot o) = \{"hello"\}\]\noindent This is expected because this regular expression is only supposed to match the ``$hello$''-string. Similarly ifwe have the choice-regular-expression $a + b$, its meaning is\[L(a + b) = \{"a", "b"\}\]\noindent You can now also see why we do not make a differencebetween the different regular expressions $(r_1 + r_2) + r_3$and $r_1 + (r_2 + r_3)$\ldots they are not the same regularexpression, but they have the same meaning. For exampleyou can do the following calculation which shows theyhave the same meaning:\begin{eqnarray*}L((r_1 + r_2) + r_3) & = & L(r_1 + r_2) \cup L(r_3)\\ & = & L(r_1) \cup L(r_2) \cup L(r_3)\\ & = & L(r_1) \cup L(r_2 + r_3)\\ & = & L(r_1 + (r_2 + r_3))\end{eqnarray*}\noindentThat means both languages are the same.The point of the definition of $L$ is that we can use it toprecisely specify when a string $s$ is matched by a regularexpression $r$, namely if and only if $s \in L(r)$. In fact wewill write a program \pcode{match} that takes a string $s$and a regular expression $r$ as arguments and returns\emph{yes}, if $s \in L(r)$ and \emph{no}, if $s \not\inL(r)$. We leave this for the next lecture.There is one more feature of regular expressions that is worthmentioning here. Given some strings, there are in general manydifferent regular expressions that can recognise thesestrings. This is obvious with the regular expression $a + b$which can match the strings $a$ and $b$. But also the regularexpression $b + a$ would match the same strings. However,sometimes it is not so obvious whether two regular expressionsmatch the same strings: for example do $r^*$ and $\ONE + r\cdot r^*$ match the same strings? What about $\ZERO^*$and $\ONE^*$? This suggests the following relation between\defn{equivalent regular expressions}: \[r_1 \equiv r_2 \;\dn\; L(r_1) = L(r_2)\]\noindent That means two regular expressions are said to beequivalent if they match the same set of strings. That istheir meanings are the same. Therefore wedo not really distinguish between the different regularexpression $(r_1 + r_2) + r_3$ and $r_1 + (r_2 + r_3)$,because they are equivalent. I leave you to the questionwhether\[\ZERO^* \equiv \ONE^*\]\noindent holds or not? Such equivalences will be importantfor our matching algorithm, because we can use them tosimplify regular expressions, which will mean we can speedup the calculations. \subsection*{My Fascination for Regular Expressions}Up until a few years ago I was not really interested in regularexpressions. They have been studied for the last 60 years (by smarterpeople than me)---surely nothing new can be found out about them. Ieven have the vague recollection that I did not quite understand themduring my undergraduate study. If I remember correctly,\footnote{That was really a long time ago.} I got utterly confused about $\ONE$(which my lecturer wrote as $\epsilon$) and the empty string (which healso wrote as $\epsilon$).\footnote{Obviously the lecturer must have been bad ;o)} Since then, I have used regular expressions forimplementing lexers and parsers as I have always been interested inall kinds of programming languages and compilers, which invariablyneed regular expressions in some form or shape.To understand my fascination \emph{nowadays} with regularexpressions, you need to know that my main scientific interestfor the last 17 years has been with theorem provers. I am acore developer of one ofthem.\footnote{\url{http://isabelle.in.tum.de}} Theoremprovers are systems in which you can formally reason aboutmathematical concepts, but also about programs. In this waytheorem provers can help with the menacing problem of writing bug-free code. Theorem provers haveproved already their value in a number of cases (even interms of hard cash), but they are still clunky and difficultto use by average programmers.Anyway, in about 2011 I came across the notion of \defn{derivatives of regular expressions}. This notion allows one to do almost allcalculations with regular expressions on the level of regularexpressions, not needing any automata (you will see we only touchbriefly on automata in lecture 3). Automata are usually the mainobject of study in formal language courses. The avoidance of automatais crucial for me because automata are graphs and it is rather difficult toreason about graphs in theorem provers. In contrast, reasoning aboutregular expressions is easy-peasy in theorem provers. Is thisimportant? I think yes, because according to Kuklewicz nearly allPOSIX-based regular expression matchers arebuggy.\footnote{\url{http://www.haskell.org/haskellwiki/Regex_Posix}}With my PhD student Fahad Ausaf I proved the correctness for one suchmatcher that was proposed by Sulzmann and Lu in2014.\footnote{\url{http://goo.gl/bz0eHp}} Hopefully we can prove thatthe machine code(!) that implements this code efficiently is correctalso. Writing programs in this way does not leave any room forpotential errors or bugs. How nice!What also helped with my fascination with regular expressionsis that we could indeed find out new things about them thathave surprised some experts. Together with two colleagues from China, I wasable to prove the Myhill-Nerode theorem by only using regularexpressions and the notion of derivatives. Earlier versions ofthis theorem used always automata in the proof. Using thistheorem we can show that regular languages are closed undercomplementation, something which Bill Gasarch in his Computational Complexityblog\footnote{\url{http://goo.gl/2R11Fw}} assumed can only beshown via automata. So even somebody who has written a 700+-pagebook\footnote{\url{http://goo.gl/fD0eHx}} on regularexpressions did not know better. Well, we showed it can also bedone with regular expressions only.\footnote{\url{http://nms.kcl.ac.uk/christian.urban/Publications/posix.pdf}}What a feeling when you are an outsider to the subject!To conclude: Despite my early ignorance about regular expressions, Ifind them now extremely interesting. They have practical importance(remember the shocking runtime of the regular expression matchers inPython, Ruby and Java in some instances and the problems in StackExchange and the Atom editor). They are used in tools like Snort andZeek in order to monitor network traffic. They have a beautiful mathematicaltheory behind them, which can be sometimes quite deep and whichsometimes contains hidden snares. People who are not very familiarwith the mathematical background of regular expressions get themconsistently wrong (this is surprising given they are a supposed to bea core skill for computer scientists). The hope is that we can do betterin the future---for example by proving that the algorithms actuallysatisfy their specification and that the corresponding implementationsdo not contain any bugs. We are close, but not yet quite there.Notwithstanding my fascination, I am also happy to admit that regularexpressions have their shortcomings. There are some well-known``theoretical'' shortcomings, for example recognising strings of theform $a^{n}b^{n}$ is not possible with regular expressions. This meansfor example if we try to recognise whether parentheses are well-nestedin an expression is impossible with (basic) regular expressions. I amnot so bothered by these shortcomings. What I am bothered about iswhen regular expressions are in the way of practical programming. Forexample, it turns out that the regular expression for email addressesshown in \eqref{email} is hopelessly inadequate for recognising all ofthem (despite being touted as something every computer scientistshould know about). The W3 Consortium (which standardises the Web)proposes to use the following, already more complicated regularexpressions for email addresses:{\small\begin{lstlisting}[language={},keywordstyle=\color{black},numbers=none][a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*\end{lstlisting}}\noindent But they admit that by using this regular expressionthey wilfully violate the RFC 5322 standard, which specifiesthe syntax of email addresses. With their proposed regularexpression they are too strict in some cases and too lax inothers\ldots{}not a good situation to be in. A regular expressionthat is claimed to be closer to the standard is shown inFigure~\ref{monster}. Whether this claim is true or not, Iwould not know---the only thing I can say about this regularexpression is it is a monstrosity. However, this mightactually be an argument against the RFC standard, rather thanagainst regular expressions. A similar argument is made in\begin{center}\url{http://elliot.land/post/its-impossible-to-validate-an-email-address}\end{center}\noindent which explains some of the crazier parts of emailaddresses. Still it is good to know that some tasks in textprocessing just cannot be achieved by using regularexpressions. But for what we want to use them (lexing) they arepretty good.\medskip\noindentFinally there is a joke about regular expressions:\begin{quote}\it ``Sometimes you have a programming problem and it seems like the best solution is to use regular expressions; now you have two problems.''\end{quote} \begin{figure}[p]\small \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}] {../progs/crawler1.scala}\caption{The Scala code for a simple web-crawler that checksfor broken links in web-pages. It uses the regular expression\texttt{http\_pattern} in Line~\ref{httpline} for recognisingURL-addresses. It finds all links using the library function\texttt{findAllIn} in Line~\ref{findallline} (this function is part of Scala's regular expression library).\label{crawler1}}\end{figure}%\begin{figure}[p]\small% \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]% {../progs/crawler2.scala}%%\caption{A version of the web-crawler that only follows links%in ``my'' domain---since these are the ones I am interested in%to fix. It uses the regular expression \texttt{my\_urls} in%Line~\ref{myurlline} to check for my name in the links. The%main change is in%Lines~\ref{changestartline}--\ref{changeendline} where there%is a test whether URL is in ``my'' domain or%not.\label{crawler2}}%\end{figure}\begin{figure}[p]\small \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}] {../progs/crawler2.scala}\caption{A small email harvester---whenever we download aweb-page, we also check whether it contains any emailaddresses. For this we use the regular expression\texttt{email\_pattern} in Line~\ref{emailline}. The mainchange is in Line~\ref{mainline} where all email addressesthat can be found in a page are printed.\label{crawler3}}\end{figure}\begin{figure}[p]\tiny\begin{center}\begin{minipage}{0.8\textwidth}\lstinputlisting[language={},keywordstyle=\color{black},numbers=none]{../progs/email-rexp}\end{minipage}\end{center}\caption{Nothing that can be said about this regularexpression\ldots{}except it is a monstrosity.\label{monster}}\end{figure}\end{document}%%% Local Variables: %%% mode: latex%%% TeX-master: t%%% End: