30 % compiler explorer  | 
    30 % compiler explorer  | 
    31 % https://gcc.godbolt.org  | 
    31 % https://gcc.godbolt.org  | 
    32   | 
    32   | 
    33   | 
    33   | 
    34 \begin{document} | 
    34 \begin{document} | 
    35 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016} | 
    35 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017} | 
    36   | 
    36   | 
    37 \section*{Handout 1} | 
    37 \section*{Handout 1} | 
    38   | 
    38   | 
    39 This module is about text processing, be it for web-crawlers,  | 
    39 This module is about text processing, be it for web-crawlers,  | 
    40 compilers, dictionaries, DNA-data and so on. When looking for  | 
    40 compilers, dictionaries, DNA-data and so on. When looking for a  | 
    41 a particular string, like $abc$ in a large text we can use the  | 
    41 particular string, like $abc$ in a large text we can use the  | 
    42 Knuth-Morris-Pratt algorithm, which is currently the most  | 
    42 Knuth-Morris-Pratt algorithm, which is currently the most efficient  | 
    43 efficient general string search algorithm. But often we do  | 
    43 general string search algorithm. But often we do \emph{not} just look | 
    44 \emph{not} just look for a particular string, but for string | 
    44 for a particular string, but for string patterns. For example in  | 
    45 patterns. For example in program code we need to identify what  | 
    45 program code we need to identify what are the keywords (if, then,  | 
    46 are the keywords, what are the identifiers etc. A pattern for  | 
    46 while, etc), what are the identifiers (variable names). A pattern for  | 
    47 identifiers could be stated as: they start with a letter,  | 
    47 identifiers could be stated as: they start with a letter, followed by  | 
    48 followed by zero or more letters, numbers and underscores.  | 
    48 zero or more letters, numbers and underscores.  Also often we face the  | 
    49 Also often we face the problem that we are given a string (for  | 
    49 problem that we are given a string (for example some user input) and  | 
    50 example some user input) and want to know whether it matches a  | 
    50 want to know whether it matches a particular pattern---be it an email  | 
    51 particular pattern---be it an email address, for example. In  | 
    51 address, for example. In this way we can exclude user input that would  | 
    52 this way we can exclude user input that would otherwise have  | 
    52 otherwise have nasty effects on our program (crashing it or making it  | 
    53 nasty effects on our program (crashing it or making it go into  | 
    53 go into an infinite loop, if not worse). The point is that the fast  | 
    54 an infinite loop, if not worse).\smallskip  | 
    54 Knuth-Morris-Pratt algorithm for strings is not good enough for such  | 
         | 
    55 string patterns.\smallskip  | 
    55   | 
    56   | 
    56 \defn{Regular expressions} help with conveniently specifying | 
    57 \defn{Regular expressions} help with conveniently specifying | 
    57 such patterns. The idea behind regular expressions is that  | 
    58 such patterns. The idea behind regular expressions is that  | 
    58 they are a simple method for describing languages (or sets of  | 
    59 they are a simple method for describing languages (or sets of  | 
    59 strings)\ldots at least languages we are interested in in  | 
    60 strings)\ldots at least languages we are interested in in  | 
    70 \end{equation} | 
    71 \end{equation} | 
    71   | 
    72   | 
    72 \noindent where the first part, the user name, matches one or more lowercase  | 
    73 \noindent where the first part, the user name, matches one or more lowercase  | 
    73 letters (\pcode{a-z}), digits (\pcode{0-9}), underscores, dots | 
    74 letters (\pcode{a-z}), digits (\pcode{0-9}), underscores, dots | 
    74 and hyphens. The \pcode{+} at the end of the brackets ensures | 
    75 and hyphens. The \pcode{+} at the end of the brackets ensures | 
    75 the ``one or more''. Then comes the \pcode{@}-sign, followed | 
    76 the ``one or more''. Then comes the email \pcode{@}-sign, followed | 
    76 by the domain name which must be one or more lowercase  | 
    77 by the domain name which must be one or more lowercase  | 
    77 letters, digits, underscores, dots or hyphens. Note there  | 
    78 letters, digits, underscores, dots or hyphens. Note there  | 
    78 cannot be an underscore in the domain name. Finally there must  | 
    79 cannot be an underscore in the domain name. Finally there must  | 
    79 be a dot followed by the toplevel domain. This toplevel domain  | 
    80 be a dot followed by the toplevel domain. This toplevel domain  | 
    80 must be 2 to 6 lowercase letters including the dot. Example  | 
    81 must be 2 to 6 lowercase letters including the dot. Example  | 
    94 \begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}] | 
    95 \begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}] | 
    95 user@localserver  | 
    96 user@localserver  | 
    96 disposable.style.email.with+symbol@example.com  | 
    97 disposable.style.email.with+symbol@example.com  | 
    97 \end{lstlisting} | 
    98 \end{lstlisting} | 
    98   | 
    99   | 
    99 \noindent according to the regular expression we specified in  | 
   100 \noindent according to the regular expression we specified in line  | 
   100 \eqref{email}. Whether this is intended or not is a different | 
   101 \eqref{email} above. Whether this is intended or not is a different | 
   101 question (the second email above is actually an acceptable  | 
   102 question (the second email above is actually an acceptable email  | 
   102 email address acording to the RFC 5322 standard for email  | 
   103 address acording to the RFC 5322 standard for email addresses).  | 
   103 addresses).  | 
         | 
   104    | 
   104    | 
   105 As mentioned above, identifiers, or variables, in program code  | 
   105 As mentioned above, identifiers, or variables, in program code  | 
   106 are often required to satisfy the constraints that they start  | 
   106 are often required to satisfy the constraints that they start  | 
   107 with a letter and then can be followed by zero or more letters  | 
   107 with a letter and then can be followed by zero or more letters  | 
   108 or numbers and also can include underscores, but not as the  | 
   108 or numbers and also can include underscores, but not as the  | 
   176 leave it to you to ponder whether this regular expression  | 
   176 leave it to you to ponder whether this regular expression  | 
   177 really captures all possible web-addresses.  | 
   177 really captures all possible web-addresses.  | 
   178   | 
   178   | 
   179 \subsection*{Why Study Regular Expressions?} | 
   179 \subsection*{Why Study Regular Expressions?} | 
   180   | 
   180   | 
   181 Regular expressions were introduced by Kleene in the 1950ies  | 
   181 Regular expressions were introduced by Kleene in the 1950ies and they  | 
   182 and they have been object of intense study since then. They  | 
   182 have been object of intense study since then. They are nowadays pretty  | 
   183 are nowadays pretty much ubiquitous in computer science. There  | 
   183 much ubiquitous in computer science. There are many libraries  | 
   184 are many libraries implementing regular expressions. I am sure  | 
   184 implementing regular expressions. I am sure you have come across them  | 
   185 you have come across them before (remember PRA?). Why on earth  | 
   185 before (remember the PRA module?). Why on earth then is there any  | 
   186 then is there any interest in studying them again in depth in  | 
   186 interest in studying them again in depth in this module? Well, one  | 
   187 this module? Well, one answer is in the following two graphs about  | 
   187 answer is in the following two graphs about regular expression  | 
   188 regular expression matching in Python, Ruby and Java.  | 
   188 matching in Python, Ruby and Java.  | 
   189   | 
   189   | 
   190 \begin{center} | 
   190 \begin{center} | 
   191 \begin{tabular}{@{\hspace{-1mm}}c@{\hspace{-1mm}}c@{}} | 
   191 \begin{tabular}{@{\hspace{-1mm}}c@{\hspace{-1mm}}c@{}} | 
   192 \begin{tikzpicture} | 
   192 \begin{tikzpicture} | 
   193 \begin{axis}[ | 
   193 \begin{axis}[ | 
   567 simplify regular expressions, which will mean we can speed  | 
   567 simplify regular expressions, which will mean we can speed  | 
   568 up the calculations.   | 
   568 up the calculations.   | 
   569   | 
   569   | 
   570 \subsection*{My Fascination for Regular Expressions} | 
   570 \subsection*{My Fascination for Regular Expressions} | 
   571   | 
   571   | 
   572 Up until a few years ago I was not really interested in  | 
   572 Up until a few years ago I was not really interested in regular  | 
   573 regular expressions. They have been studied for the last 60  | 
   573 expressions. They have been studied for the last 60 years (by smarter  | 
   574 years (by smarter people than me)---surely nothing new can be  | 
   574 people than me)---surely nothing new can be found out about them. I  | 
   575 found out about them. I even have the vague recollection that  | 
   575 even have the vague recollection that I did not quite understand them  | 
   576 I did not quite understand them during my undergraduate study. If I remember  | 
   576 during my undergraduate study. If I remember correctly,\footnote{That | 
   577 correctly,\footnote{That was really a long time ago.} I got | 
   577   was really a long time ago.} I got utterly confused about $\ONE$  | 
   578 utterly confused about $\ONE$ (which my lecturer wrote as  | 
   578 (which my lecturer wrote as $\epsilon$) and the empty string (which he  | 
   579 $\epsilon$) and the empty string.\footnote{Obviously the | 
   579 also wrote as $\epsilon$).\footnote{Obviously the lecturer must have | 
   580 lecturer must have been bad.} Since my then, I have used  | 
   580   been bad ;o)} Since my then, I have used regular expressions for  | 
   581 regular expressions for implementing lexers and parsers as I  | 
   581 implementing lexers and parsers as I have always been interested in  | 
   582 have always been interested in all kinds of programming  | 
   582 all kinds of programming languages and compilers, which invariably  | 
   583 languages and compilers, which invariably need regular  | 
   583 need regular expressions in some form or shape.  | 
   584 expressions in some form or shape.   | 
         | 
   585   | 
   584   | 
   586 To understand my fascination \emph{nowadays} with regular | 
   585 To understand my fascination \emph{nowadays} with regular | 
   587 expressions, you need to know that my main scientific interest  | 
   586 expressions, you need to know that my main scientific interest  | 
   588 for the last 14 years has been with theorem provers. I am a  | 
   587 for the last 17 years has been with theorem provers. I am a  | 
   589 core developer of one of  | 
   588 core developer of one of  | 
   590 them.\footnote{\url{http://isabelle.in.tum.de}} Theorem | 
   589 them.\footnote{\url{http://isabelle.in.tum.de}} Theorem | 
   591 provers are systems in which you can formally reason about  | 
   590 provers are systems in which you can formally reason about  | 
   592 mathematical concepts, but also about programs. In this way  | 
   591 mathematical concepts, but also about programs. In this way  | 
   593 theorem prover can help with the manacing problem of writing bug-free code. Theorem provers have  | 
   592 theorem provers can help with the manacing problem of writing bug-free code. Theorem provers have  | 
   594 proved already their value in a number of cases (even in  | 
   593 proved already their value in a number of cases (even in  | 
   595 terms of hard cash), but they are still clunky and difficult  | 
   594 terms of hard cash), but they are still clunky and difficult  | 
   596 to use by average programmers.  | 
   595 to use by average programmers.  | 
   597   | 
   596   | 
   598 Anyway, in about 2011 I came across the notion of  | 
   597 Anyway, in about 2011 I came across the notion of \defn{derivatives of | 
   599 \defn{derivatives of regular expressions}. This notion allows | 
   598   regular expressions}. This notion allows one to do almost all  | 
   600 one to do almost all calculations in regular language theory  | 
   599 calculations with regular expressions on the level of regular  | 
   601 on the level of regular expressions, not needing any automata (you will  | 
   600 expressions, not needing any automata (you will see we only touch  | 
   602 see we only touch briefly on automata in lecture 3).  | 
   601 briefly on automata in lecture 3). Automata are usually the main  | 
   603 This is crucial because automata are graphs and it is rather  | 
   602 object of study in formal language courses.  The avoidance of automata  | 
   604 difficult to reason about graphs in theorem provers. In  | 
   603 is crucial because automata are graphs and it is rather difficult to  | 
   605 contrast, reasoning about regular expressions is easy-peasy in  | 
   604 reason about graphs in theorem provers. In contrast, reasoning about  | 
   606 theorem provers. Is this important? I think yes, because  | 
   605 regular expressions is easy-peasy in theorem provers. Is this  | 
   607 according to Kuklewicz nearly all POSIX-based regular  | 
   606 important? I think yes, because according to Kuklewicz nearly all  | 
   608 expression matchers are  | 
   607 POSIX-based regular expression matchers are  | 
   609 buggy.\footnote{\url{http://www.haskell.org/haskellwiki/Regex_Posix}} | 
   608 buggy.\footnote{\url{http://www.haskell.org/haskellwiki/Regex_Posix}} | 
   610 With my PhD student Fahad Ausaf I proved  | 
   609 With my PhD student Fahad Ausaf I proved the correctness for one such  | 
   611 the correctness for one such matcher that was  | 
   610 matcher that was proposed by Sulzmann and Lu in  | 
   612 proposed by Sulzmann and Lu in  | 
   611 2014.\footnote{\url{http://goo.gl/bz0eHp}} Hopefully we can prove that | 
   613 2014.\footnote{\url{http://goo.gl/bz0eHp}} Hopefully we can | 
   612 the machine code(!)  that implements this code efficiently is correct  | 
   614 prove that the machine code(!)  | 
   613 also. Writing programs in this way does not leave any room for  | 
   615 that implements this code efficiently is correct also. Writing  | 
   614 potential errors or bugs. How nice!  | 
   616 programs in this way does not leave any room for potential  | 
         | 
   617 errors or bugs. How nice!  | 
         | 
   618   | 
   615   | 
   619 What also helped with my fascination with regular expressions  | 
   616 What also helped with my fascination with regular expressions  | 
   620 is that we could indeed find out new things about them that  | 
   617 is that we could indeed find out new things about them that  | 
   621 have surprised some experts. Together with two colleagues from China, I was  | 
   618 have surprised some experts. Together with two colleagues from China, I was  | 
   622 able to prove the Myhill-Nerode theorem by only using regular  | 
   619 able to prove the Myhill-Nerode theorem by only using regular  | 
   627 blog\footnote{\url{http://goo.gl/2R11Fw}} assumed can only be | 
   624 blog\footnote{\url{http://goo.gl/2R11Fw}} assumed can only be | 
   628 shown via automata. Even sombody who has written a 700+-page  | 
   625 shown via automata. Even sombody who has written a 700+-page  | 
   629 book\footnote{\url{http://goo.gl/fD0eHx}} on regular | 
   626 book\footnote{\url{http://goo.gl/fD0eHx}} on regular | 
   630 exprssions did not know better. Well, we showed it can also be  | 
   627 exprssions did not know better. Well, we showed it can also be  | 
   631 done with regular expressions only.\footnote{\url{http://www.inf.kcl.ac.uk/staff/urbanc/Publications/rexp.pdf}}  | 
   628 done with regular expressions only.\footnote{\url{http://www.inf.kcl.ac.uk/staff/urbanc/Publications/rexp.pdf}}  | 
   632 What a feeling if you are an outsider to the subject!  | 
   629 What a feeling when you are an outsider to the subject!  | 
   633   | 
   630   | 
   634 To conclude: Despite my early ignorance about regular  | 
   631 To conclude: Despite my early ignorance about regular expressions, I  | 
   635 expressions, I find them now very interesting. They have a  | 
   632 find them now very interesting. They have a beautiful mathematical  | 
   636 beautiful mathematical theory behind them, which can be  | 
   633 theory behind them, which can be sometimes quite deep and which  | 
   637 sometimes quite deep and which sometimes contains hidden snares. They have  | 
   634 sometimes contains hidden snares. They have practical importance  | 
   638 practical importance (remember the shocking runtime of the  | 
   635 (remember the shocking runtime of the regular expression matchers in  | 
   639 regular expression matchers in Python, Ruby and Java in some  | 
   636 Python, Ruby and Java in some instances and the problems in Stack  | 
   640 instances and the problems in Stack Exchange and the Atom editor).   | 
   637 Exchange and the Atom editor).  People who are not very familiar with  | 
   641 People who are not very familiar with the  | 
   638 the mathematical background of regular expressions get them  | 
   642 mathematical background of regular expressions get them  | 
   639 consistently wrong (surprising given they are a supposed to be core  | 
   643 consistently wrong. The hope is that we can do better in the  | 
   640 skill for computer scientists). The hope is that we can do better in  | 
   644 future---for example by proving that the algorithms actually  | 
   641 the future---for example by proving that the algorithms actually  | 
   645 satisfy their specification and that the corresponding  | 
   642 satisfy their specification and that the corresponding implementations  | 
   646 implementations do not contain any bugs. We are close, but not  | 
   643 do not contain any bugs. We are close, but not yet quite there.  | 
   647 yet quite there.  | 
         | 
   648   | 
   644   | 
   649 Notwithstanding my fascination, I am also happy to admit that regular  | 
   645 Notwithstanding my fascination, I am also happy to admit that regular  | 
   650 expressions have their shortcomings. There are some well-known  | 
   646 expressions have their shortcomings. There are some well-known  | 
   651 ``theoretical'' shortcomings, for example recognising strings  | 
   647 ``theoretical'' shortcomings, for example recognising strings of the  | 
   652 of the form $a^{n}b^{n}$. I am not so bothered by them. What I | 
   648 form $a^{n}b^{n}$ is not possible with regular expressions. This means | 
   653 am bothered about is when regular expressions are in the way  | 
   649 for example if we try to regognise whether parentheses are well-nested  | 
   654 of practical programming. For example, it turns out that the  | 
   650 is impossible with (basic) regular expressions.  I am not so bothered  | 
   655 regular expression for email addresses shown in \eqref{email} | 
   651 by these shortcomings. What I am bothered about is when regular  | 
   656 is hopelessly inadequate for recognising all of them (despite  | 
   652 expressions are in the way of practical programming. For example, it  | 
   657 being touted as something every computer scientist should know  | 
   653 turns out that the regular expression for email addresses shown in  | 
   658 about). The W3 Consortium (which standardises the Web)  | 
   654 \eqref{email} is hopelessly inadequate for recognising all of them | 
   659 proposes to use the following, already more complicated  | 
   655 (despite being touted as something every computer scientist should  | 
   660 regular expressions for email addresses:  | 
   656 know about). The W3 Consortium (which standardises the Web) proposes  | 
         | 
   657 to use the following, already more complicated regular expressions for  | 
         | 
   658 email addresses:  | 
   661   | 
   659   | 
   662 {\small\begin{lstlisting}[language={},keywordstyle=\color{black},numbers=none] | 
   660 {\small\begin{lstlisting}[language={},keywordstyle=\color{black},numbers=none] | 
   663 [a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)* | 
   661 [a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)* | 
   664 \end{lstlisting}} | 
   662 \end{lstlisting}} | 
   665   | 
   663   | 
   666 \noindent But they admit that by using this regular expression  | 
   664 \noindent But they admit that by using this regular expression  | 
   667 they wilfully violate the RFC 5322 standard which specifies  | 
   665 they wilfully violate the RFC 5322 standard, which specifies  | 
   668 the syntax of email addresses. With their proposed regular  | 
   666 the syntax of email addresses. With their proposed regular  | 
   669 expression they are too strict in some cases and too lax in  | 
   667 expression they are too strict in some cases and too lax in  | 
   670 others. Not a good situation to be in. A regular expression  | 
   668 others. Not a good situation to be in. A regular expression  | 
   671 that is claimed to be closer to the standard is shown in  | 
   669 that is claimed to be closer to the standard is shown in  | 
   672 Figure~\ref{monster}. Whether this claim is true or not, I | 
   670 Figure~\ref{monster}. Whether this claim is true or not, I |