handouts/ho01.tex
changeset 471 9476086849ad
parent 452 b93f4d2aeee1
child 473 dc528091eb70
child 503 f2d7b885b3e3
equal deleted inserted replaced
469:1f4e81950ab4 471:9476086849ad
    30 % compiler explorer
    30 % compiler explorer
    31 % https://gcc.godbolt.org
    31 % https://gcc.godbolt.org
    32 
    32 
    33 
    33 
    34 \begin{document}
    34 \begin{document}
    35 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016}
    35 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017}
    36 
    36 
    37 \section*{Handout 1}
    37 \section*{Handout 1}
    38 
    38 
    39 This module is about text processing, be it for web-crawlers,
    39 This module is about text processing, be it for web-crawlers,
    40 compilers, dictionaries, DNA-data and so on. When looking for
    40 compilers, dictionaries, DNA-data and so on. When looking for a
    41 a particular string, like $abc$ in a large text we can use the
    41 particular string, like $abc$ in a large text we can use the
    42 Knuth-Morris-Pratt algorithm, which is currently the most
    42 Knuth-Morris-Pratt algorithm, which is currently the most efficient
    43 efficient general string search algorithm. But often we do
    43 general string search algorithm. But often we do \emph{not} just look
    44 \emph{not} just look for a particular string, but for string
    44 for a particular string, but for string patterns. For example in
    45 patterns. For example in program code we need to identify what
    45 program code we need to identify what are the keywords (if, then,
    46 are the keywords, what are the identifiers etc. A pattern for
    46 while, etc), what are the identifiers (variable names). A pattern for
    47 identifiers could be stated as: they start with a letter,
    47 identifiers could be stated as: they start with a letter, followed by
    48 followed by zero or more letters, numbers and underscores.
    48 zero or more letters, numbers and underscores.  Also often we face the
    49 Also often we face the problem that we are given a string (for
    49 problem that we are given a string (for example some user input) and
    50 example some user input) and want to know whether it matches a
    50 want to know whether it matches a particular pattern---be it an email
    51 particular pattern---be it an email address, for example. In
    51 address, for example. In this way we can exclude user input that would
    52 this way we can exclude user input that would otherwise have
    52 otherwise have nasty effects on our program (crashing it or making it
    53 nasty effects on our program (crashing it or making it go into
    53 go into an infinite loop, if not worse). The point is that the fast
    54 an infinite loop, if not worse).\smallskip
    54 Knuth-Morris-Pratt algorithm for strings is not good enough for such
       
    55 string patterns.\smallskip
    55 
    56 
    56 \defn{Regular expressions} help with conveniently specifying
    57 \defn{Regular expressions} help with conveniently specifying
    57 such patterns. The idea behind regular expressions is that
    58 such patterns. The idea behind regular expressions is that
    58 they are a simple method for describing languages (or sets of
    59 they are a simple method for describing languages (or sets of
    59 strings)\ldots at least languages we are interested in in
    60 strings)\ldots at least languages we are interested in in
    70 \end{equation}
    71 \end{equation}
    71 
    72 
    72 \noindent where the first part, the user name, matches one or more lowercase
    73 \noindent where the first part, the user name, matches one or more lowercase
    73 letters (\pcode{a-z}), digits (\pcode{0-9}), underscores, dots
    74 letters (\pcode{a-z}), digits (\pcode{0-9}), underscores, dots
    74 and hyphens. The \pcode{+} at the end of the brackets ensures
    75 and hyphens. The \pcode{+} at the end of the brackets ensures
    75 the ``one or more''. Then comes the \pcode{@}-sign, followed
    76 the ``one or more''. Then comes the email \pcode{@}-sign, followed
    76 by the domain name which must be one or more lowercase
    77 by the domain name which must be one or more lowercase
    77 letters, digits, underscores, dots or hyphens. Note there
    78 letters, digits, underscores, dots or hyphens. Note there
    78 cannot be an underscore in the domain name. Finally there must
    79 cannot be an underscore in the domain name. Finally there must
    79 be a dot followed by the toplevel domain. This toplevel domain
    80 be a dot followed by the toplevel domain. This toplevel domain
    80 must be 2 to 6 lowercase letters including the dot. Example
    81 must be 2 to 6 lowercase letters including the dot. Example
    94 \begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}]
    95 \begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}]
    95 user@localserver
    96 user@localserver
    96 disposable.style.email.with+symbol@example.com
    97 disposable.style.email.with+symbol@example.com
    97 \end{lstlisting}
    98 \end{lstlisting}
    98 
    99 
    99 \noindent according to the regular expression we specified in
   100 \noindent according to the regular expression we specified in line
   100 \eqref{email}. Whether this is intended or not is a different
   101 \eqref{email} above. Whether this is intended or not is a different
   101 question (the second email above is actually an acceptable
   102 question (the second email above is actually an acceptable email
   102 email address acording to the RFC 5322 standard for email
   103 address acording to the RFC 5322 standard for email addresses).
   103 addresses).
       
   104  
   104  
   105 As mentioned above, identifiers, or variables, in program code
   105 As mentioned above, identifiers, or variables, in program code
   106 are often required to satisfy the constraints that they start
   106 are often required to satisfy the constraints that they start
   107 with a letter and then can be followed by zero or more letters
   107 with a letter and then can be followed by zero or more letters
   108 or numbers and also can include underscores, but not as the
   108 or numbers and also can include underscores, but not as the
   176 leave it to you to ponder whether this regular expression
   176 leave it to you to ponder whether this regular expression
   177 really captures all possible web-addresses.
   177 really captures all possible web-addresses.
   178 
   178 
   179 \subsection*{Why Study Regular Expressions?}
   179 \subsection*{Why Study Regular Expressions?}
   180 
   180 
   181 Regular expressions were introduced by Kleene in the 1950ies
   181 Regular expressions were introduced by Kleene in the 1950ies and they
   182 and they have been object of intense study since then. They
   182 have been object of intense study since then. They are nowadays pretty
   183 are nowadays pretty much ubiquitous in computer science. There
   183 much ubiquitous in computer science. There are many libraries
   184 are many libraries implementing regular expressions. I am sure
   184 implementing regular expressions. I am sure you have come across them
   185 you have come across them before (remember PRA?). Why on earth
   185 before (remember the PRA module?). Why on earth then is there any
   186 then is there any interest in studying them again in depth in
   186 interest in studying them again in depth in this module? Well, one
   187 this module? Well, one answer is in the following two graphs about
   187 answer is in the following two graphs about regular expression
   188 regular expression matching in Python, Ruby and Java.
   188 matching in Python, Ruby and Java.
   189 
   189 
   190 \begin{center}
   190 \begin{center}
   191 \begin{tabular}{@{\hspace{-1mm}}c@{\hspace{-1mm}}c@{}}
   191 \begin{tabular}{@{\hspace{-1mm}}c@{\hspace{-1mm}}c@{}}
   192 \begin{tikzpicture}
   192 \begin{tikzpicture}
   193 \begin{axis}[
   193 \begin{axis}[
   421 \texttt{[...]+} \;\cdot\; \texttt{.} \;\cdot\; 
   421 \texttt{[...]+} \;\cdot\; \texttt{.} \;\cdot\; 
   422 \texttt{[...]\{2,6\}}
   422 \texttt{[...]\{2,6\}}
   423 \]
   423 \]
   424 
   424 
   425 \noindent
   425 \noindent
   426 which is much less readable than \eqref{email}. Similarly for
   426 which is much less readable than the regular expression in
   427 the regular expression that matches the string $hello$ we 
   427 \eqref{email}. Similarly for the regular expression that matches the
   428 should write
   428 string $hello$ we should write
   429 
   429 
   430 \[
   430 \[
   431 h \cdot e \cdot l \cdot l \cdot o
   431 h \cdot e \cdot l \cdot l \cdot o
   432 \]
   432 \]
   433 
   433 
   458 issue, then the latter can be seen as syntactic sugar of
   458 issue, then the latter can be seen as syntactic sugar of
   459 the former. For example we could replace
   459 the former. For example we could replace
   460 
   460 
   461 \begin{center}
   461 \begin{center}
   462 \begin{tabular}{rcl}
   462 \begin{tabular}{rcl}
   463 $r+$ & $\mapsto$ & $r\cdot r^*$\\
   463 $r^+$ & $\mapsto$ & $r\cdot r^*$\\
   464 $r?$ & $\mapsto$ & $\ONE + r$\\
   464 $r^?$ & $\mapsto$ & $\ONE + r$\\
   465 $\backslash d$ & $\mapsto$ & $0 + 1 + 2 + \ldots + 9$\\
   465 $\backslash d$ & $\mapsto$ & $0 + 1 + 2 + \ldots + 9$\\
   466 $[\text{\it a - z}]$ & $\mapsto$ & $a + b + \ldots + z$\\
   466 $[\text{\it a - z}]$ & $\mapsto$ & $a + b + \ldots + z$\\
   467 \end{tabular}
   467 \end{tabular}
   468 \end{center}
   468 \end{center}
   469 
   469 
   567 simplify regular expressions, which will mean we can speed
   567 simplify regular expressions, which will mean we can speed
   568 up the calculations. 
   568 up the calculations. 
   569 
   569 
   570 \subsection*{My Fascination for Regular Expressions}
   570 \subsection*{My Fascination for Regular Expressions}
   571 
   571 
   572 Up until a few years ago I was not really interested in
   572 Up until a few years ago I was not really interested in regular
   573 regular expressions. They have been studied for the last 60
   573 expressions. They have been studied for the last 60 years (by smarter
   574 years (by smarter people than me)---surely nothing new can be
   574 people than me)---surely nothing new can be found out about them. I
   575 found out about them. I even have the vague recollection that
   575 even have the vague recollection that I did not quite understand them
   576 I did not quite understand them during my undergraduate study. If I remember
   576 during my undergraduate study. If I remember correctly,\footnote{That
   577 correctly,\footnote{That was really a long time ago.} I got
   577   was really a long time ago.} I got utterly confused about $\ONE$
   578 utterly confused about $\ONE$ (which my lecturer wrote as
   578 (which my lecturer wrote as $\epsilon$) and the empty string (which he
   579 $\epsilon$) and the empty string.\footnote{Obviously the
   579 also wrote as $\epsilon$).\footnote{Obviously the lecturer must have
   580 lecturer must have been bad.} Since my then, I have used
   580   been bad ;o)} Since my then, I have used regular expressions for
   581 regular expressions for implementing lexers and parsers as I
   581 implementing lexers and parsers as I have always been interested in
   582 have always been interested in all kinds of programming
   582 all kinds of programming languages and compilers, which invariably
   583 languages and compilers, which invariably need regular
   583 need regular expressions in some form or shape.
   584 expressions in some form or shape. 
       
   585 
   584 
   586 To understand my fascination \emph{nowadays} with regular
   585 To understand my fascination \emph{nowadays} with regular
   587 expressions, you need to know that my main scientific interest
   586 expressions, you need to know that my main scientific interest
   588 for the last 14 years has been with theorem provers. I am a
   587 for the last 17 years has been with theorem provers. I am a
   589 core developer of one of
   588 core developer of one of
   590 them.\footnote{\url{http://isabelle.in.tum.de}} Theorem
   589 them.\footnote{\url{http://isabelle.in.tum.de}} Theorem
   591 provers are systems in which you can formally reason about
   590 provers are systems in which you can formally reason about
   592 mathematical concepts, but also about programs. In this way
   591 mathematical concepts, but also about programs. In this way
   593 theorem prover can help with the manacing problem of writing bug-free code. Theorem provers have
   592 theorem provers can help with the manacing problem of writing bug-free code. Theorem provers have
   594 proved already their value in a number of cases (even in
   593 proved already their value in a number of cases (even in
   595 terms of hard cash), but they are still clunky and difficult
   594 terms of hard cash), but they are still clunky and difficult
   596 to use by average programmers.
   595 to use by average programmers.
   597 
   596 
   598 Anyway, in about 2011 I came across the notion of
   597 Anyway, in about 2011 I came across the notion of \defn{derivatives of
   599 \defn{derivatives of regular expressions}. This notion allows
   598   regular expressions}. This notion allows one to do almost all
   600 one to do almost all calculations in regular language theory
   599 calculations with regular expressions on the level of regular
   601 on the level of regular expressions, not needing any automata (you will
   600 expressions, not needing any automata (you will see we only touch
   602 see we only touch briefly on automata in lecture 3).
   601 briefly on automata in lecture 3). Automata are usually the main
   603 This is crucial because automata are graphs and it is rather
   602 object of study in formal language courses.  The avoidance of automata
   604 difficult to reason about graphs in theorem provers. In
   603 is crucial because automata are graphs and it is rather difficult to
   605 contrast, reasoning about regular expressions is easy-peasy in
   604 reason about graphs in theorem provers. In contrast, reasoning about
   606 theorem provers. Is this important? I think yes, because
   605 regular expressions is easy-peasy in theorem provers. Is this
   607 according to Kuklewicz nearly all POSIX-based regular
   606 important? I think yes, because according to Kuklewicz nearly all
   608 expression matchers are
   607 POSIX-based regular expression matchers are
   609 buggy.\footnote{\url{http://www.haskell.org/haskellwiki/Regex_Posix}}
   608 buggy.\footnote{\url{http://www.haskell.org/haskellwiki/Regex_Posix}}
   610 With my PhD student Fahad Ausaf I proved
   609 With my PhD student Fahad Ausaf I proved the correctness for one such
   611 the correctness for one such matcher that was
   610 matcher that was proposed by Sulzmann and Lu in
   612 proposed by Sulzmann and Lu in
   611 2014.\footnote{\url{http://goo.gl/bz0eHp}} Hopefully we can prove that
   613 2014.\footnote{\url{http://goo.gl/bz0eHp}} Hopefully we can
   612 the machine code(!)  that implements this code efficiently is correct
   614 prove that the machine code(!)
   613 also. Writing programs in this way does not leave any room for
   615 that implements this code efficiently is correct also. Writing
   614 potential errors or bugs. How nice!
   616 programs in this way does not leave any room for potential
       
   617 errors or bugs. How nice!
       
   618 
   615 
   619 What also helped with my fascination with regular expressions
   616 What also helped with my fascination with regular expressions
   620 is that we could indeed find out new things about them that
   617 is that we could indeed find out new things about them that
   621 have surprised some experts. Together with two colleagues from China, I was
   618 have surprised some experts. Together with two colleagues from China, I was
   622 able to prove the Myhill-Nerode theorem by only using regular
   619 able to prove the Myhill-Nerode theorem by only using regular
   627 blog\footnote{\url{http://goo.gl/2R11Fw}} assumed can only be
   624 blog\footnote{\url{http://goo.gl/2R11Fw}} assumed can only be
   628 shown via automata. Even sombody who has written a 700+-page
   625 shown via automata. Even sombody who has written a 700+-page
   629 book\footnote{\url{http://goo.gl/fD0eHx}} on regular
   626 book\footnote{\url{http://goo.gl/fD0eHx}} on regular
   630 exprssions did not know better. Well, we showed it can also be
   627 exprssions did not know better. Well, we showed it can also be
   631 done with regular expressions only.\footnote{\url{http://www.inf.kcl.ac.uk/staff/urbanc/Publications/rexp.pdf}} 
   628 done with regular expressions only.\footnote{\url{http://www.inf.kcl.ac.uk/staff/urbanc/Publications/rexp.pdf}} 
   632 What a feeling if you are an outsider to the subject!
   629 What a feeling when you are an outsider to the subject!
   633 
   630 
   634 To conclude: Despite my early ignorance about regular
   631 To conclude: Despite my early ignorance about regular expressions, I
   635 expressions, I find them now very interesting. They have a
   632 find them now very interesting. They have a beautiful mathematical
   636 beautiful mathematical theory behind them, which can be
   633 theory behind them, which can be sometimes quite deep and which
   637 sometimes quite deep and which sometimes contains hidden snares. They have
   634 sometimes contains hidden snares. They have practical importance
   638 practical importance (remember the shocking runtime of the
   635 (remember the shocking runtime of the regular expression matchers in
   639 regular expression matchers in Python, Ruby and Java in some
   636 Python, Ruby and Java in some instances and the problems in Stack
   640 instances and the problems in Stack Exchange and the Atom editor). 
   637 Exchange and the Atom editor).  People who are not very familiar with
   641 People who are not very familiar with the
   638 the mathematical background of regular expressions get them
   642 mathematical background of regular expressions get them
   639 consistently wrong (surprising given they are a supposed to be core
   643 consistently wrong. The hope is that we can do better in the
   640 skill for computer scientists). The hope is that we can do better in
   644 future---for example by proving that the algorithms actually
   641 the future---for example by proving that the algorithms actually
   645 satisfy their specification and that the corresponding
   642 satisfy their specification and that the corresponding implementations
   646 implementations do not contain any bugs. We are close, but not
   643 do not contain any bugs. We are close, but not yet quite there.
   647 yet quite there.
       
   648 
   644 
   649 Notwithstanding my fascination, I am also happy to admit that regular
   645 Notwithstanding my fascination, I am also happy to admit that regular
   650 expressions have their shortcomings. There are some well-known
   646 expressions have their shortcomings. There are some well-known
   651 ``theoretical'' shortcomings, for example recognising strings
   647 ``theoretical'' shortcomings, for example recognising strings of the
   652 of the form $a^{n}b^{n}$. I am not so bothered by them. What I
   648 form $a^{n}b^{n}$ is not possible with regular expressions. This means
   653 am bothered about is when regular expressions are in the way
   649 for example if we try to regognise whether parentheses are well-nested
   654 of practical programming. For example, it turns out that the
   650 is impossible with (basic) regular expressions.  I am not so bothered
   655 regular expression for email addresses shown in \eqref{email}
   651 by these shortcomings. What I am bothered about is when regular
   656 is hopelessly inadequate for recognising all of them (despite
   652 expressions are in the way of practical programming. For example, it
   657 being touted as something every computer scientist should know
   653 turns out that the regular expression for email addresses shown in
   658 about). The W3 Consortium (which standardises the Web)
   654 \eqref{email} is hopelessly inadequate for recognising all of them
   659 proposes to use the following, already more complicated
   655 (despite being touted as something every computer scientist should
   660 regular expressions for email addresses:
   656 know about). The W3 Consortium (which standardises the Web) proposes
       
   657 to use the following, already more complicated regular expressions for
       
   658 email addresses:
   661 
   659 
   662 {\small\begin{lstlisting}[language={},keywordstyle=\color{black},numbers=none]
   660 {\small\begin{lstlisting}[language={},keywordstyle=\color{black},numbers=none]
   663 [a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*
   661 [a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*
   664 \end{lstlisting}}
   662 \end{lstlisting}}
   665 
   663 
   666 \noindent But they admit that by using this regular expression
   664 \noindent But they admit that by using this regular expression
   667 they wilfully violate the RFC 5322 standard which specifies
   665 they wilfully violate the RFC 5322 standard, which specifies
   668 the syntax of email addresses. With their proposed regular
   666 the syntax of email addresses. With their proposed regular
   669 expression they are too strict in some cases and too lax in
   667 expression they are too strict in some cases and too lax in
   670 others. Not a good situation to be in. A regular expression
   668 others. Not a good situation to be in. A regular expression
   671 that is claimed to be closer to the standard is shown in
   669 that is claimed to be closer to the standard is shown in
   672 Figure~\ref{monster}. Whether this claim is true or not, I
   670 Figure~\ref{monster}. Whether this claim is true or not, I
   681 
   679 
   682 
   680 
   683 \noindent which explains some of the crazier parts of email
   681 \noindent which explains some of the crazier parts of email
   684 addresses. Still it is good to know that some tasks in text
   682 addresses. Still it is good to know that some tasks in text
   685 processing just cannot be achieved by using regular
   683 processing just cannot be achieved by using regular
   686 expressions.
   684 expressions. But for what we want to use them (lexing) they are
       
   685 pretty good.
   687 
   686 
   688 
   687 
   689 \begin{figure}[p]
   688 \begin{figure}[p]
   690 \lstinputlisting{../progs/crawler1.scala}
   689 \lstinputlisting{../progs/crawler1.scala}
   691 
   690