handouts/ho01.tex
changeset 550 71fc4a7a7039
parent 507 fdbc7d0ec04f
child 551 bd551ede2be6
equal deleted inserted replaced
549:352d15782d35 550:71fc4a7a7039
    34 % compiler explorer
    34 % compiler explorer
    35 % https://gcc.godbolt.org
    35 % https://gcc.godbolt.org
    36 
    36 
    37 %https://www.youtube.com/watch?v=gmhMQfFQu20
    37 %https://www.youtube.com/watch?v=gmhMQfFQu20
    38 \begin{document}
    38 \begin{document}
    39 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017}
    39 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2018}
    40 
    40 
    41 \section*{Handout 1}
    41 \section*{Handout 1}
    42 
    42 
    43 This module is about text processing, be it for web-crawlers,
    43 This module is about text processing, be it for web-crawlers,
    44 compilers, dictionaries, DNA-data, ad filters and so on.  When looking for a
    44 compilers, dictionaries, DNA-data, ad filters and so on.  When looking for a
    45 particular string, like $abc$ in a large text we can use the
    45 particular string, like $abc$ in a large text we can use the
    46 Knuth-Morris-Pratt algorithm, which is currently the most efficient
    46 Knuth-Morris-Pratt algorithm, which is currently the most efficient
    47 general string search algorithm. But often we do \emph{not} just look
    47 general string search algorithm. But often we do \emph{not} just look
    48 for a particular string, but for string patterns. For example, in
    48 for a particular string, but for string patterns. For example, in
    49 program code we need to identify what are the keywords (if, then,
    49 program code we need to identify what are the keywords (\texttt{if}, \texttt{then},
    50 while, for, etc), what are the identifiers (variable names). A pattern for
    50 \texttt{while}, \texttt{for}, etc), what are the identifiers (variable names). A pattern for
    51 identifiers could be stated as: they start with a letter, followed by
    51 identifiers could be stated as: they start with a letter, followed by
    52 zero or more letters, numbers and underscores.  Often we also face the
    52 zero or more letters, numbers and underscores.  Often we also face the
    53 problem that we are given a string (for example some user input) and
    53 problem that we are given a string (for example some user input) and
    54 want to know whether it matches a particular pattern---be it an email
    54 want to know whether it matches a particular pattern---be it an email
    55 address, for example. In this way we can exclude user input that would
    55 address, for example. In this way we can exclude user input that would
   104 \end{lstlisting}
   104 \end{lstlisting}
   105 
   105 
   106 \noindent according to the regular expression we specified in line
   106 \noindent according to the regular expression we specified in line
   107 \eqref{email} above. Whether this is intended or not is a different
   107 \eqref{email} above. Whether this is intended or not is a different
   108 question (the second email above is actually an acceptable email
   108 question (the second email above is actually an acceptable email
   109 address acording to the RFC 5322 standard for email addresses).
   109 address according to the RFC 5322 standard for email addresses).
   110  
   110  
   111 As mentioned above, identifiers, or variables, in program code
   111 As mentioned above, identifiers, or variables, in program code
   112 are often required to satisfy the constraints that they start
   112 are often required to satisfy the constraints that they start
   113 with a letter and then can be followed by zero or more letters
   113 with a letter and then can be followed by zero or more letters
   114 or numbers and also can include underscores, but not as the
   114 or numbers and also can include underscores, but not as the
   138 \pcode{re?} &	 matches 0 or 1 occurrence of preceding 
   138 \pcode{re?} &	 matches 0 or 1 occurrence of preceding 
   139 expression\\
   139 expression\\
   140 \pcode{re\{n\}}	& matches exactly \pcode{n} number of 
   140 \pcode{re\{n\}}	& matches exactly \pcode{n} number of 
   141 occurrences of preceding  expression\\
   141 occurrences of preceding  expression\\
   142 \pcode{re\{n,m\}} & matches at least \pcode{n} and at most {\tt m}
   142 \pcode{re\{n,m\}} & matches at least \pcode{n} and at most {\tt m}
   143 occurences of the preceding expression\\
   143 occurrences of the preceding expression\\
   144 \pcode{[...]} & matches any single character inside the 
   144 \pcode{[...]} & matches any single character inside the 
   145 brackets\\
   145 brackets\\
   146 \pcode{[^...]} & matches any single character not inside the 
   146 \pcode{[^...]} & matches any single character not inside the 
   147 brackets\\
   147 brackets\\
   148 \pcode{...-...} & character ranges\\
   148 \pcode{...-...} & character ranges\\
   189 much ubiquitous in computer science. There are many libraries
   189 much ubiquitous in computer science. There are many libraries
   190 implementing regular expressions. I am sure you have come across them
   190 implementing regular expressions. I am sure you have come across them
   191 before (remember the PRA module?). Why on earth then is there any
   191 before (remember the PRA module?). Why on earth then is there any
   192 interest in studying them again in depth in this module? Well, one
   192 interest in studying them again in depth in this module? Well, one
   193 answer is in the following two graphs about regular expression
   193 answer is in the following two graphs about regular expression
   194 matching in Python, Ruby and Java.
   194 matching in Python, Ruby and Java (Version 8).
   195 
   195 
   196 \begin{center}
   196 \begin{center}
   197 \begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1mm}}c@{}}
   197 \begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1mm}}c@{}}
   198 \begin{tikzpicture}
   198 \begin{tikzpicture}
   199 \begin{axis}[
   199 \begin{axis}[
   209     ytick={0,5,...,30},
   209     ytick={0,5,...,30},
   210     scaled ticks=false,
   210     scaled ticks=false,
   211     axis lines=left,
   211     axis lines=left,
   212     width=5.5cm,
   212     width=5.5cm,
   213     height=4.5cm, 
   213     height=4.5cm, 
   214     legend entries={Python, Java},  
   214     legend entries={Python, Java 8},  
   215     legend pos=north west,
   215     legend pos=north west,
   216     legend cell align=left]
   216     legend cell align=left]
   217 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
   217 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
   218 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
   218 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
   219 \end{axis}
   219 \end{axis}
   271 \begin{center}
   271 \begin{center}
   272 \url{https://vimeo.com/112065252}
   272 \url{https://vimeo.com/112065252}
   273 \end{center}
   273 \end{center}
   274 
   274 
   275 \noindent
   275 \noindent
   276 A similar problem also occured in the Atom editor:
   276 A similar problem also occurred in the Atom editor:
   277 
   277 
   278 \begin{center}
   278 \begin{center}
   279 \url{http://davidvgalbraith.com/how-i-fixed-atom/}
   279 \url{http://davidvgalbraith.com/how-i-fixed-atom/}
   280 \end{center}
   280 \end{center}
   281 
   281 
   282 \noindent
   282 \noindent
   283 Such troublesome regular expressions are sometimes called \emph{evil
   283 Such troublesome regular expressions are sometimes called \emph{evil
   284   regular expressions} because they have the potential to make regular
   284   regular expressions} because they have the potential to make regular
   285 expression matching engines to topple over, like in Python, Ruby and
   285 expression matching engines to topple over, like in Python, Ruby and
   286 Java. This ``toppling over'' is also sometimes called \emph{catastrophic
   286 Java. This ``toppling over'' is also sometimes called
   287   backtracking}.  The problem with evil regular expressions is that
   287 \emph{catastrophic backtracking}.  I have also seen the term
   288 they can have some serious consequences, for example, if you use them
   288 \emph{eternal matching} used for this.  The problem with evil regular
   289 in your web-application. The reason is that hackers can look for these
   289 expressions is that they can have some serious consequences, for
   290 instances where the matching engine behaves badly and then mount a
   290 example, if you use them in your web-application. The reason is that
   291 nice DoS-attack against your application. These attacks are already
   291 hackers can look for these instances where the matching engine behaves
   292 have their own name: \emph{Regular Expression Denial of Service
   292 badly and then mount a nice DoS-attack against your application. These
   293   Attacks (ReDoS)}.
   293 attacks are already have their own name: \emph{Regular Expression
       
   294   Denial of Service Attacks (ReDoS)}.
   294 
   295 
   295 It will be instructive to look behind the ``scenes'' to find
   296 It will be instructive to look behind the ``scenes'' to find
   296 out why Python and Ruby (and others) behave so badly when
   297 out why Python and Ruby (and others) behave so badly when
   297 matching strings with evil regular expressions. But we will also look
   298 matching strings with evil regular expressions. But we will also look
   298 at a relatively simple algorithm that solves this problem much
   299 at a relatively simple algorithm that solves this problem much
   403 structure of regular expressions is always clear. But for
   404 structure of regular expressions is always clear. But for
   404 writing them down in a more mathematical fashion, parentheses
   405 writing them down in a more mathematical fashion, parentheses
   405 will be helpful. For example we will write $(r_1 + r_2)^*$,
   406 will be helpful. For example we will write $(r_1 + r_2)^*$,
   406 which is different from, say $r_1 + (r_2)^*$. The former means
   407 which is different from, say $r_1 + (r_2)^*$. The former means
   407 roughly zero or more times $r_1$ or $r_2$, while the latter
   408 roughly zero or more times $r_1$ or $r_2$, while the latter
   408 means $r_1$ or zero or more times $r_2$. This will turn out to
   409 means $r_1$, or zero or more times $r_2$. This will turn out to
   409 be two different patterns, which match in general different
   410 be two different patterns, which match in general different
   410 strings. We should also write $(r_1 + r_2) + r_3$, which is
   411 strings. We should also write $(r_1 + r_2) + r_3$, which is
   411 different from the regular expression $r_1 + (r_2 + r_3)$, but
   412 different from the regular expression $r_1 + (r_2 + r_3)$, but
   412 in case of $+$ and $\cdot$ we actually do not care about the
   413 in case of $+$ and $\cdot$ we actually do not care about the
   413 order and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2
   414 order and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2
   414 \cdot r_3$, respectively. The reasons for this will become
   415 \cdot r_3$, respectively. The reasons for this carelessness will become
   415 clear shortly. 
   416 clear shortly. 
   416 
   417 
   417 In the literature you will often find that the choice $r_1 +
   418 In the literature you will often find that the choice $r_1 +
   418 r_2$ is written as $r_1\mid{}r_2$ or $r_1\mid\mid{}r_2$. Also,
   419 r_2$ is written as $r_1\mid{}r_2$ or $r_1\mid\mid{}r_2$. Also,
   419 often our $\ZERO$ and $\ONE$ are written $\varnothing$ and
   420 often our $\ZERO$ and $\ONE$ are written $\varnothing$ and
   420 $\epsilon$, respectively. Following the convention in the
   421 $\epsilon$, respectively. Following the convention in the
   421 literature, we will often omit the $\cdot$ all together. This
   422 literature, we will often omit the $\cdot$. This
   422 is to make some concrete regular expressions more readable.
   423 is to make some concrete regular expressions more readable.
   423 For example the regular expression for email addresses shown
   424 For example the regular expression for email addresses shown
   424 in \eqref{email} would look like
   425 in \eqref{email} would fully expanded look like
   425 
   426 
   426 \[
   427 \[
   427 \texttt{[...]+} \;\cdot\;  \texttt{@} \;\cdot\; 
   428 \texttt{[...]+} \;\cdot\;  \texttt{@} \;\cdot\; 
   428 \texttt{[...]+} \;\cdot\; \texttt{.} \;\cdot\; 
   429 \texttt{[...]+} \;\cdot\; \texttt{.} \;\cdot\; 
   429 \texttt{[...]\{2,6\}}
   430 \texttt{[...]\{2,6\}}
   557 \[
   558 \[
   558 r_1 \equiv r_2 \;\dn\; L(r_1) = L(r_2)
   559 r_1 \equiv r_2 \;\dn\; L(r_1) = L(r_2)
   559 \]
   560 \]
   560 
   561 
   561 \noindent That means two regular expressions are said to be
   562 \noindent That means two regular expressions are said to be
   562 equivalent if they match the same set of strings. Therefore we
   563 equivalent if they match the same set of strings. That is
       
   564 there meaning is the same. Therefore we
   563 do not really distinguish between the different regular
   565 do not really distinguish between the different regular
   564 expression $(r_1 + r_2) + r_3$ and $r_1 + (r_2 + r_3)$,
   566 expression $(r_1 + r_2) + r_3$ and $r_1 + (r_2 + r_3)$,
   565 because they are equivalent. I leave you to the question
   567 because they are equivalent. I leave you to the question
   566 whether
   568 whether
   567 
   569 
   582 even have the vague recollection that I did not quite understand them
   584 even have the vague recollection that I did not quite understand them
   583 during my undergraduate study. If I remember correctly,\footnote{That
   585 during my undergraduate study. If I remember correctly,\footnote{That
   584   was really a long time ago.} I got utterly confused about $\ONE$
   586   was really a long time ago.} I got utterly confused about $\ONE$
   585 (which my lecturer wrote as $\epsilon$) and the empty string (which he
   587 (which my lecturer wrote as $\epsilon$) and the empty string (which he
   586 also wrote as $\epsilon$).\footnote{Obviously the lecturer must have
   588 also wrote as $\epsilon$).\footnote{Obviously the lecturer must have
   587   been bad ;o)} Since my then, I have used regular expressions for
   589   been bad ;o)} Since then, I have used regular expressions for
   588 implementing lexers and parsers as I have always been interested in
   590 implementing lexers and parsers as I have always been interested in
   589 all kinds of programming languages and compilers, which invariably
   591 all kinds of programming languages and compilers, which invariably
   590 need regular expressions in some form or shape.
   592 need regular expressions in some form or shape.
   591 
   593 
   592 To understand my fascination \emph{nowadays} with regular
   594 To understand my fascination \emph{nowadays} with regular
   594 for the last 17 years has been with theorem provers. I am a
   596 for the last 17 years has been with theorem provers. I am a
   595 core developer of one of
   597 core developer of one of
   596 them.\footnote{\url{http://isabelle.in.tum.de}} Theorem
   598 them.\footnote{\url{http://isabelle.in.tum.de}} Theorem
   597 provers are systems in which you can formally reason about
   599 provers are systems in which you can formally reason about
   598 mathematical concepts, but also about programs. In this way
   600 mathematical concepts, but also about programs. In this way
   599 theorem provers can help with the manacing problem of writing bug-free code. Theorem provers have
   601 theorem provers can help with the menacing problem of writing bug-free code. Theorem provers have
   600 proved already their value in a number of cases (even in
   602 proved already their value in a number of cases (even in
   601 terms of hard cash), but they are still clunky and difficult
   603 terms of hard cash), but they are still clunky and difficult
   602 to use by average programmers.
   604 to use by average programmers.
   603 
   605 
   604 Anyway, in about 2011 I came across the notion of \defn{derivatives of
   606 Anyway, in about 2011 I came across the notion of \defn{derivatives of
   605   regular expressions}. This notion allows one to do almost all
   607   regular expressions}. This notion allows one to do almost all
   606 calculations with regular expressions on the level of regular
   608 calculations with regular expressions on the level of regular
   607 expressions, not needing any automata (you will see we only touch
   609 expressions, not needing any automata (you will see we only touch
   608 briefly on automata in lecture 3). Automata are usually the main
   610 briefly on automata in lecture 3). Automata are usually the main
   609 object of study in formal language courses.  The avoidance of automata
   611 object of study in formal language courses.  The avoidance of automata
   610 is crucial because automata are graphs and it is rather difficult to
   612 is crucial for me because automata are graphs and it is rather difficult to
   611 reason about graphs in theorem provers. In contrast, reasoning about
   613 reason about graphs in theorem provers. In contrast, reasoning about
   612 regular expressions is easy-peasy in theorem provers. Is this
   614 regular expressions is easy-peasy in theorem provers. Is this
   613 important? I think yes, because according to Kuklewicz nearly all
   615 important? I think yes, because according to Kuklewicz nearly all
   614 POSIX-based regular expression matchers are
   616 POSIX-based regular expression matchers are
   615 buggy.\footnote{\url{http://www.haskell.org/haskellwiki/Regex_Posix}}
   617 buggy.\footnote{\url{http://www.haskell.org/haskellwiki/Regex_Posix}}
   627 expressions and the notion of derivatives. Earlier versions of
   629 expressions and the notion of derivatives. Earlier versions of
   628 this theorem used always automata in the proof. Using this
   630 this theorem used always automata in the proof. Using this
   629 theorem we can show that regular languages are closed under
   631 theorem we can show that regular languages are closed under
   630 complementation, something which Gasarch in his
   632 complementation, something which Gasarch in his
   631 blog\footnote{\url{http://goo.gl/2R11Fw}} assumed can only be
   633 blog\footnote{\url{http://goo.gl/2R11Fw}} assumed can only be
   632 shown via automata. Even sombody who has written a 700+-page
   634 shown via automata. So even somebody who has written a 700+-page
   633 book\footnote{\url{http://goo.gl/fD0eHx}} on regular
   635 book\footnote{\url{http://goo.gl/fD0eHx}} on regular
   634 exprssions did not know better. Well, we showed it can also be
   636 expressions did not know better. Well, we showed it can also be
   635 done with regular expressions only.\footnote{\url{http://nms.kcl.ac.uk/christian.urban/Publications/posix.pdf}}
   637 done with regular expressions only.\footnote{\url{http://nms.kcl.ac.uk/christian.urban/Publications/posix.pdf}}
   636 What a feeling when you are an outsider to the subject!
   638 What a feeling when you are an outsider to the subject!
   637 
   639 
   638 To conclude: Despite my early ignorance about regular expressions, I
   640 To conclude: Despite my early ignorance about regular expressions, I
   639 find them now very interesting. They have a beautiful mathematical
   641 find them now extremely interesting. They have practical importance
   640 theory behind them, which can be sometimes quite deep and which
       
   641 sometimes contains hidden snares. They have practical importance
       
   642 (remember the shocking runtime of the regular expression matchers in
   642 (remember the shocking runtime of the regular expression matchers in
   643 Python, Ruby and Java in some instances and the problems in Stack
   643 Python, Ruby and Java in some instances and the problems in Stack
   644 Exchange and the Atom editor).  People who are not very familiar with
   644 Exchange and the Atom editor). They are used in tools like Snort and
   645 the mathematical background of regular expressions get them
   645 Bro in order to monitor network traffic. They have a beautiful mathematical
       
   646 theory behind them, which can be sometimes quite deep and which
       
   647 sometimes contains hidden snares.  People who are not very familiar
       
   648 with the mathematical background of regular expressions get them
   646 consistently wrong (this is surprising given they are a supposed to be
   649 consistently wrong (this is surprising given they are a supposed to be
   647 core skill for computer scientists). The hope is that we can do better
   650 core skill for computer scientists). The hope is that we can do better
   648 in the future---for example by proving that the algorithms actually
   651 in the future---for example by proving that the algorithms actually
   649 satisfy their specification and that the corresponding implementations
   652 satisfy their specification and that the corresponding implementations
   650 do not contain any bugs. We are close, but not yet quite there.
   653 do not contain any bugs. We are close, but not yet quite there.
   651 
   654 
   652 Notwithstanding my fascination, I am also happy to admit that regular
   655 Notwithstanding my fascination, I am also happy to admit that regular
   653 expressions have their shortcomings. There are some well-known
   656 expressions have their shortcomings. There are some well-known
   654 ``theoretical'' shortcomings, for example recognising strings of the
   657 ``theoretical'' shortcomings, for example recognising strings of the
   655 form $a^{n}b^{n}$ is not possible with regular expressions. This means
   658 form $a^{n}b^{n}$ is not possible with regular expressions. This means
   656 for example if we try to regognise whether parentheses are well-nested
   659 for example if we try to recognise whether parentheses are well-nested
   657 in an expression is impossible with (basic) regular expressions.  I am
   660 in an expression is impossible with (basic) regular expressions.  I am
   658 not so bothered by these shortcomings. What I am bothered about is
   661 not so bothered by these shortcomings. What I am bothered about is
   659 when regular expressions are in the way of practical programming. For
   662 when regular expressions are in the way of practical programming. For
   660 example, it turns out that the regular expression for email addresses
   663 example, it turns out that the regular expression for email addresses
   661 shown in \eqref{email} is hopelessly inadequate for recognising all of
   664 shown in \eqref{email} is hopelessly inadequate for recognising all of
   700   problems.''
   703   problems.''
   701 \end{quote}  
   704 \end{quote}  
   702 
   705 
   703 
   706 
   704 \begin{figure}[p]\small
   707 \begin{figure}[p]\small
   705   \lstinputlisting[linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
   708   \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
   706                   {../progs/crawler1.scala}
   709                   {../progs/crawler1.scala}
   707 
   710 
   708 \caption{The Scala code for a simple web-crawler that checks
   711 \caption{The Scala code for a simple web-crawler that checks
   709 for broken links in a web-page. It uses the regular expression
   712 for broken links in a web-page. It uses the regular expression
   710 \texttt{http\_pattern} in Line~\ref{httpline} for recognising
   713 \texttt{http\_pattern} in Line~\ref{httpline} for recognising
   714 \end{figure}
   717 \end{figure}
   715 
   718 
   716 
   719 
   717 
   720 
   718 \begin{figure}[p]\small
   721 \begin{figure}[p]\small
   719   \lstinputlisting[linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
   722   \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
   720                   {../progs/crawler2.scala}
   723                   {../progs/crawler2.scala}
   721 
   724 
   722 \caption{A version of the web-crawler that only follows links
   725 \caption{A version of the web-crawler that only follows links
   723 in ``my'' domain---since these are the ones I am interested in
   726 in ``my'' domain---since these are the ones I am interested in
   724 to fix. It uses the regular expression \texttt{my\_urls} in
   727 to fix. It uses the regular expression \texttt{my\_urls} in
   729 not.\label{crawler2}}
   732 not.\label{crawler2}}
   730 
   733 
   731 \end{figure}
   734 \end{figure}
   732 
   735 
   733 \begin{figure}[p]\small
   736 \begin{figure}[p]\small
   734   \lstinputlisting[linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
   737   \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
   735                   {../progs/crawler3.scala}
   738                   {../progs/crawler3.scala}
   736 
   739 
   737 \caption{A small email harvester---whenever we download a
   740 \caption{A small email harvester---whenever we download a
   738 web-page, we also check whether it contains any email
   741 web-page, we also check whether it contains any email
   739 addresses. For this we use the regular expression
   742 addresses. For this we use the regular expression