30 % compiler explorer |
30 % compiler explorer |
31 % https://gcc.godbolt.org |
31 % https://gcc.godbolt.org |
32 |
32 |
33 |
33 |
34 \begin{document} |
34 \begin{document} |
35 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016} |
35 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017} |
36 |
36 |
37 \section*{Handout 1} |
37 \section*{Handout 1} |
38 |
38 |
39 This module is about text processing, be it for web-crawlers, |
39 This module is about text processing, be it for web-crawlers, |
40 compilers, dictionaries, DNA-data and so on. When looking for |
40 compilers, dictionaries, DNA-data and so on. When looking for a |
41 a particular string, like $abc$ in a large text we can use the |
41 particular string, like $abc$ in a large text we can use the |
42 Knuth-Morris-Pratt algorithm, which is currently the most |
42 Knuth-Morris-Pratt algorithm, which is currently the most efficient |
43 efficient general string search algorithm. But often we do |
43 general string search algorithm. But often we do \emph{not} just look |
44 \emph{not} just look for a particular string, but for string |
44 for a particular string, but for string patterns. For example in |
45 patterns. For example in program code we need to identify what |
45 program code we need to identify what are the keywords (if, then, |
46 are the keywords, what are the identifiers etc. A pattern for |
46 while, etc), what are the identifiers (variable names). A pattern for |
47 identifiers could be stated as: they start with a letter, |
47 identifiers could be stated as: they start with a letter, followed by |
48 followed by zero or more letters, numbers and underscores. |
48 zero or more letters, numbers and underscores. Also often we face the |
49 Also often we face the problem that we are given a string (for |
49 problem that we are given a string (for example some user input) and |
50 example some user input) and want to know whether it matches a |
50 want to know whether it matches a particular pattern---be it an email |
51 particular pattern---be it an email address, for example. In |
51 address, for example. In this way we can exclude user input that would |
52 this way we can exclude user input that would otherwise have |
52 otherwise have nasty effects on our program (crashing it or making it |
53 nasty effects on our program (crashing it or making it go into |
53 go into an infinite loop, if not worse). The point is that the fast |
54 an infinite loop, if not worse).\smallskip |
54 Knuth-Morris-Pratt algorithm for strings is not good enough for such |
|
55 string patterns.\smallskip |
55 |
56 |
56 \defn{Regular expressions} help with conveniently specifying |
57 \defn{Regular expressions} help with conveniently specifying |
57 such patterns. The idea behind regular expressions is that |
58 such patterns. The idea behind regular expressions is that |
58 they are a simple method for describing languages (or sets of |
59 they are a simple method for describing languages (or sets of |
59 strings)\ldots at least languages we are interested in in |
60 strings)\ldots at least languages we are interested in in |
70 \end{equation} |
71 \end{equation} |
71 |
72 |
72 \noindent where the first part, the user name, matches one or more lowercase |
73 \noindent where the first part, the user name, matches one or more lowercase |
73 letters (\pcode{a-z}), digits (\pcode{0-9}), underscores, dots |
74 letters (\pcode{a-z}), digits (\pcode{0-9}), underscores, dots |
74 and hyphens. The \pcode{+} at the end of the brackets ensures |
75 and hyphens. The \pcode{+} at the end of the brackets ensures |
75 the ``one or more''. Then comes the \pcode{@}-sign, followed |
76 the ``one or more''. Then comes the email \pcode{@}-sign, followed |
76 by the domain name which must be one or more lowercase |
77 by the domain name which must be one or more lowercase |
77 letters, digits, underscores, dots or hyphens. Note there |
78 letters, digits, underscores, dots or hyphens. Note there |
78 cannot be an underscore in the domain name. Finally there must |
79 cannot be an underscore in the domain name. Finally there must |
79 be a dot followed by the toplevel domain. This toplevel domain |
80 be a dot followed by the toplevel domain. This toplevel domain |
80 must be 2 to 6 lowercase letters including the dot. Example |
81 must be 2 to 6 lowercase letters including the dot. Example |
94 \begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}] |
95 \begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}] |
95 user@localserver |
96 user@localserver |
96 disposable.style.email.with+symbol@example.com |
97 disposable.style.email.with+symbol@example.com |
97 \end{lstlisting} |
98 \end{lstlisting} |
98 |
99 |
99 \noindent according to the regular expression we specified in |
100 \noindent according to the regular expression we specified in line |
100 \eqref{email}. Whether this is intended or not is a different |
101 \eqref{email} above. Whether this is intended or not is a different |
101 question (the second email above is actually an acceptable |
102 question (the second email above is actually an acceptable email |
102 email address acording to the RFC 5322 standard for email |
103 address acording to the RFC 5322 standard for email addresses). |
103 addresses). |
|
104 |
104 |
105 As mentioned above, identifiers, or variables, in program code |
105 As mentioned above, identifiers, or variables, in program code |
106 are often required to satisfy the constraints that they start |
106 are often required to satisfy the constraints that they start |
107 with a letter and then can be followed by zero or more letters |
107 with a letter and then can be followed by zero or more letters |
108 or numbers and also can include underscores, but not as the |
108 or numbers and also can include underscores, but not as the |
176 leave it to you to ponder whether this regular expression |
176 leave it to you to ponder whether this regular expression |
177 really captures all possible web-addresses. |
177 really captures all possible web-addresses. |
178 |
178 |
179 \subsection*{Why Study Regular Expressions?} |
179 \subsection*{Why Study Regular Expressions?} |
180 |
180 |
181 Regular expressions were introduced by Kleene in the 1950ies |
181 Regular expressions were introduced by Kleene in the 1950ies and they |
182 and they have been object of intense study since then. They |
182 have been object of intense study since then. They are nowadays pretty |
183 are nowadays pretty much ubiquitous in computer science. There |
183 much ubiquitous in computer science. There are many libraries |
184 are many libraries implementing regular expressions. I am sure |
184 implementing regular expressions. I am sure you have come across them |
185 you have come across them before (remember PRA?). Why on earth |
185 before (remember the PRA module?). Why on earth then is there any |
186 then is there any interest in studying them again in depth in |
186 interest in studying them again in depth in this module? Well, one |
187 this module? Well, one answer is in the following two graphs about |
187 answer is in the following two graphs about regular expression |
188 regular expression matching in Python, Ruby and Java. |
188 matching in Python, Ruby and Java. |
189 |
189 |
190 \begin{center} |
190 \begin{center} |
191 \begin{tabular}{@{\hspace{-1mm}}c@{\hspace{-1mm}}c@{}} |
191 \begin{tabular}{@{\hspace{-1mm}}c@{\hspace{-1mm}}c@{}} |
192 \begin{tikzpicture} |
192 \begin{tikzpicture} |
193 \begin{axis}[ |
193 \begin{axis}[ |
567 simplify regular expressions, which will mean we can speed |
567 simplify regular expressions, which will mean we can speed |
568 up the calculations. |
568 up the calculations. |
569 |
569 |
570 \subsection*{My Fascination for Regular Expressions} |
570 \subsection*{My Fascination for Regular Expressions} |
571 |
571 |
572 Up until a few years ago I was not really interested in |
572 Up until a few years ago I was not really interested in regular |
573 regular expressions. They have been studied for the last 60 |
573 expressions. They have been studied for the last 60 years (by smarter |
574 years (by smarter people than me)---surely nothing new can be |
574 people than me)---surely nothing new can be found out about them. I |
575 found out about them. I even have the vague recollection that |
575 even have the vague recollection that I did not quite understand them |
576 I did not quite understand them during my undergraduate study. If I remember |
576 during my undergraduate study. If I remember correctly,\footnote{That |
577 correctly,\footnote{That was really a long time ago.} I got |
577 was really a long time ago.} I got utterly confused about $\ONE$ |
578 utterly confused about $\ONE$ (which my lecturer wrote as |
578 (which my lecturer wrote as $\epsilon$) and the empty string (which he |
579 $\epsilon$) and the empty string.\footnote{Obviously the |
579 also wrote as $\epsilon$).\footnote{Obviously the lecturer must have |
580 lecturer must have been bad.} Since my then, I have used |
580 been bad ;o)} Since my then, I have used regular expressions for |
581 regular expressions for implementing lexers and parsers as I |
581 implementing lexers and parsers as I have always been interested in |
582 have always been interested in all kinds of programming |
582 all kinds of programming languages and compilers, which invariably |
583 languages and compilers, which invariably need regular |
583 need regular expressions in some form or shape. |
584 expressions in some form or shape. |
|
585 |
584 |
586 To understand my fascination \emph{nowadays} with regular |
585 To understand my fascination \emph{nowadays} with regular |
587 expressions, you need to know that my main scientific interest |
586 expressions, you need to know that my main scientific interest |
588 for the last 14 years has been with theorem provers. I am a |
587 for the last 17 years has been with theorem provers. I am a |
589 core developer of one of |
588 core developer of one of |
590 them.\footnote{\url{http://isabelle.in.tum.de}} Theorem |
589 them.\footnote{\url{http://isabelle.in.tum.de}} Theorem |
591 provers are systems in which you can formally reason about |
590 provers are systems in which you can formally reason about |
592 mathematical concepts, but also about programs. In this way |
591 mathematical concepts, but also about programs. In this way |
593 theorem prover can help with the manacing problem of writing bug-free code. Theorem provers have |
592 theorem provers can help with the manacing problem of writing bug-free code. Theorem provers have |
594 proved already their value in a number of cases (even in |
593 proved already their value in a number of cases (even in |
595 terms of hard cash), but they are still clunky and difficult |
594 terms of hard cash), but they are still clunky and difficult |
596 to use by average programmers. |
595 to use by average programmers. |
597 |
596 |
598 Anyway, in about 2011 I came across the notion of |
597 Anyway, in about 2011 I came across the notion of \defn{derivatives of |
599 \defn{derivatives of regular expressions}. This notion allows |
598 regular expressions}. This notion allows one to do almost all |
600 one to do almost all calculations in regular language theory |
599 calculations with regular expressions on the level of regular |
601 on the level of regular expressions, not needing any automata (you will |
600 expressions, not needing any automata (you will see we only touch |
602 see we only touch briefly on automata in lecture 3). |
601 briefly on automata in lecture 3). Automata are usually the main |
603 This is crucial because automata are graphs and it is rather |
602 object of study in formal language courses. The avoidance of automata |
604 difficult to reason about graphs in theorem provers. In |
603 is crucial because automata are graphs and it is rather difficult to |
605 contrast, reasoning about regular expressions is easy-peasy in |
604 reason about graphs in theorem provers. In contrast, reasoning about |
606 theorem provers. Is this important? I think yes, because |
605 regular expressions is easy-peasy in theorem provers. Is this |
607 according to Kuklewicz nearly all POSIX-based regular |
606 important? I think yes, because according to Kuklewicz nearly all |
608 expression matchers are |
607 POSIX-based regular expression matchers are |
609 buggy.\footnote{\url{http://www.haskell.org/haskellwiki/Regex_Posix}} |
608 buggy.\footnote{\url{http://www.haskell.org/haskellwiki/Regex_Posix}} |
610 With my PhD student Fahad Ausaf I proved |
609 With my PhD student Fahad Ausaf I proved the correctness for one such |
611 the correctness for one such matcher that was |
610 matcher that was proposed by Sulzmann and Lu in |
612 proposed by Sulzmann and Lu in |
611 2014.\footnote{\url{http://goo.gl/bz0eHp}} Hopefully we can prove that |
613 2014.\footnote{\url{http://goo.gl/bz0eHp}} Hopefully we can |
612 the machine code(!) that implements this code efficiently is correct |
614 prove that the machine code(!) |
613 also. Writing programs in this way does not leave any room for |
615 that implements this code efficiently is correct also. Writing |
614 potential errors or bugs. How nice! |
616 programs in this way does not leave any room for potential |
|
617 errors or bugs. How nice! |
|
618 |
615 |
619 What also helped with my fascination with regular expressions |
616 What also helped with my fascination with regular expressions |
620 is that we could indeed find out new things about them that |
617 is that we could indeed find out new things about them that |
621 have surprised some experts. Together with two colleagues from China, I was |
618 have surprised some experts. Together with two colleagues from China, I was |
622 able to prove the Myhill-Nerode theorem by only using regular |
619 able to prove the Myhill-Nerode theorem by only using regular |
627 blog\footnote{\url{http://goo.gl/2R11Fw}} assumed can only be |
624 blog\footnote{\url{http://goo.gl/2R11Fw}} assumed can only be |
628 shown via automata. Even sombody who has written a 700+-page |
625 shown via automata. Even sombody who has written a 700+-page |
629 book\footnote{\url{http://goo.gl/fD0eHx}} on regular |
626 book\footnote{\url{http://goo.gl/fD0eHx}} on regular |
630 exprssions did not know better. Well, we showed it can also be |
627 exprssions did not know better. Well, we showed it can also be |
631 done with regular expressions only.\footnote{\url{http://www.inf.kcl.ac.uk/staff/urbanc/Publications/rexp.pdf}} |
628 done with regular expressions only.\footnote{\url{http://www.inf.kcl.ac.uk/staff/urbanc/Publications/rexp.pdf}} |
632 What a feeling if you are an outsider to the subject! |
629 What a feeling when you are an outsider to the subject! |
633 |
630 |
634 To conclude: Despite my early ignorance about regular |
631 To conclude: Despite my early ignorance about regular expressions, I |
635 expressions, I find them now very interesting. They have a |
632 find them now very interesting. They have a beautiful mathematical |
636 beautiful mathematical theory behind them, which can be |
633 theory behind them, which can be sometimes quite deep and which |
637 sometimes quite deep and which sometimes contains hidden snares. They have |
634 sometimes contains hidden snares. They have practical importance |
638 practical importance (remember the shocking runtime of the |
635 (remember the shocking runtime of the regular expression matchers in |
639 regular expression matchers in Python, Ruby and Java in some |
636 Python, Ruby and Java in some instances and the problems in Stack |
640 instances and the problems in Stack Exchange and the Atom editor). |
637 Exchange and the Atom editor). People who are not very familiar with |
641 People who are not very familiar with the |
638 the mathematical background of regular expressions get them |
642 mathematical background of regular expressions get them |
639 consistently wrong (surprising given they are a supposed to be core |
643 consistently wrong. The hope is that we can do better in the |
640 skill for computer scientists). The hope is that we can do better in |
644 future---for example by proving that the algorithms actually |
641 the future---for example by proving that the algorithms actually |
645 satisfy their specification and that the corresponding |
642 satisfy their specification and that the corresponding implementations |
646 implementations do not contain any bugs. We are close, but not |
643 do not contain any bugs. We are close, but not yet quite there. |
647 yet quite there. |
|
648 |
644 |
649 Notwithstanding my fascination, I am also happy to admit that regular |
645 Notwithstanding my fascination, I am also happy to admit that regular |
650 expressions have their shortcomings. There are some well-known |
646 expressions have their shortcomings. There are some well-known |
651 ``theoretical'' shortcomings, for example recognising strings |
647 ``theoretical'' shortcomings, for example recognising strings of the |
652 of the form $a^{n}b^{n}$. I am not so bothered by them. What I |
648 form $a^{n}b^{n}$ is not possible with regular expressions. This means |
653 am bothered about is when regular expressions are in the way |
649 for example if we try to regognise whether parentheses are well-nested |
654 of practical programming. For example, it turns out that the |
650 is impossible with (basic) regular expressions. I am not so bothered |
655 regular expression for email addresses shown in \eqref{email} |
651 by these shortcomings. What I am bothered about is when regular |
656 is hopelessly inadequate for recognising all of them (despite |
652 expressions are in the way of practical programming. For example, it |
657 being touted as something every computer scientist should know |
653 turns out that the regular expression for email addresses shown in |
658 about). The W3 Consortium (which standardises the Web) |
654 \eqref{email} is hopelessly inadequate for recognising all of them |
659 proposes to use the following, already more complicated |
655 (despite being touted as something every computer scientist should |
660 regular expressions for email addresses: |
656 know about). The W3 Consortium (which standardises the Web) proposes |
|
657 to use the following, already more complicated regular expressions for |
|
658 email addresses: |
661 |
659 |
662 {\small\begin{lstlisting}[language={},keywordstyle=\color{black},numbers=none] |
660 {\small\begin{lstlisting}[language={},keywordstyle=\color{black},numbers=none] |
663 [a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)* |
661 [a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)* |
664 \end{lstlisting}} |
662 \end{lstlisting}} |
665 |
663 |
666 \noindent But they admit that by using this regular expression |
664 \noindent But they admit that by using this regular expression |
667 they wilfully violate the RFC 5322 standard which specifies |
665 they wilfully violate the RFC 5322 standard, which specifies |
668 the syntax of email addresses. With their proposed regular |
666 the syntax of email addresses. With their proposed regular |
669 expression they are too strict in some cases and too lax in |
667 expression they are too strict in some cases and too lax in |
670 others. Not a good situation to be in. A regular expression |
668 others. Not a good situation to be in. A regular expression |
671 that is claimed to be closer to the standard is shown in |
669 that is claimed to be closer to the standard is shown in |
672 Figure~\ref{monster}. Whether this claim is true or not, I |
670 Figure~\ref{monster}. Whether this claim is true or not, I |