handouts/ho01.tex
changeset 504 5dc452d7c08e
parent 503 f2d7b885b3e3
parent 496 5c9de27a5b30
child 507 fdbc7d0ec04f
equal deleted inserted replaced
503:f2d7b885b3e3 504:5dc452d7c08e
     1 \documentclass{article}
     1 \documentclass{article}
     2 \usepackage{../style}
     2 \usepackage{../style}
     3 \usepackage{../langs}
     3 \usepackage{../langs}
     4 \usepackage{../graphics}
     4 \usepackage{../graphics}
     5 \usepackage{../data}
     5 \usepackage{../data}
       
     6 \usepackage{lstlinebgrd}
       
     7 \definecolor{capri}{rgb}{0.0, 0.75, 1.0}
     6 
     8 
     7 %%http://regexcrossword.com/challenges/cities/puzzles/1
     9 %%http://regexcrossword.com/challenges/cities/puzzles/1
     8 %%https://jex.im/regulex/
    10 %%https://jex.im/regulex/
     9 %%https://www.reddit.com/r/ProgrammingLanguages/comments/42dlem/mona_compiler_development_part_2_parsing/
    11 %%https://www.reddit.com/r/ProgrammingLanguages/comments/42dlem/mona_compiler_development_part_2_parsing/
    10 %%https://www.reddit.com/r/ProgrammingLanguages/comments/43wlkq/formal_grammar_for_csh_tsch_sh_or_bash/
    12 %%https://www.reddit.com/r/ProgrammingLanguages/comments/43wlkq/formal_grammar_for_csh_tsch_sh_or_bash/
    14 %% https://www.debuggex.com
    16 %% https://www.debuggex.com
    15 %% https://jex.im/regulex/#!embed=false&flags=&re=%5E(a%7Cb)*%3F%24
    17 %% https://jex.im/regulex/#!embed=false&flags=&re=%5E(a%7Cb)*%3F%24
    16 
    18 
    17 %% email validator
    19 %% email validator
    18 %% http://www.ex-parrot.com/%7Epdw/Mail-RFC822-Address.html
    20 %% http://www.ex-parrot.com/%7Epdw/Mail-RFC822-Address.html
       
    21 % https://jackfoxy.github.io/FsRegEx/emailregex.html
       
    22 
    19 
    23 
    20 %% regex testers
    24 %% regex testers
    21 % https://regex101.com
    25 % https://regex101.com
    22 % http://regexr.com
    26 % http://regexr.com
    23 
    27 
    35 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017}
    39 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017}
    36 
    40 
    37 \section*{Handout 1}
    41 \section*{Handout 1}
    38 
    42 
    39 This module is about text processing, be it for web-crawlers,
    43 This module is about text processing, be it for web-crawlers,
    40 compilers, dictionaries, DNA-data and so on. When looking for a
    44 compilers, dictionaries, DNA-data, ad filters and so on.  When looking for a
    41 particular string, like $abc$ in a large text we can use the
    45 particular string, like $abc$ in a large text we can use the
    42 Knuth-Morris-Pratt algorithm, which is currently the most efficient
    46 Knuth-Morris-Pratt algorithm, which is currently the most efficient
    43 general string search algorithm. But often we do \emph{not} just look
    47 general string search algorithm. But often we do \emph{not} just look
    44 for a particular string, but for string patterns. For example in
    48 for a particular string, but for string patterns. For example, in
    45 program code we need to identify what are the keywords (if, then,
    49 program code we need to identify what are the keywords (if, then,
    46 while, etc), what are the identifiers (variable names). A pattern for
    50 while, for, etc), what are the identifiers (variable names). A pattern for
    47 identifiers could be stated as: they start with a letter, followed by
    51 identifiers could be stated as: they start with a letter, followed by
    48 zero or more letters, numbers and underscores.  Also often we face the
    52 zero or more letters, numbers and underscores.  Often we also face the
    49 problem that we are given a string (for example some user input) and
    53 problem that we are given a string (for example some user input) and
    50 want to know whether it matches a particular pattern---be it an email
    54 want to know whether it matches a particular pattern---be it an email
    51 address, for example. In this way we can exclude user input that would
    55 address, for example. In this way we can exclude user input that would
    52 otherwise have nasty effects on our program (crashing it or making it
    56 otherwise have nasty effects on our program (crashing it or making it
    53 go into an infinite loop, if not worse). The point is that the fast
    57 go into an infinite loop, if not worse). In tools like Snort, scanning
    54 Knuth-Morris-Pratt algorithm for strings is not good enough for such
    58 for computer viruses or filtering out spam usually involves scanning
    55 string patterns.\smallskip
    59 for some signature (essentially a string pattern).  The point is that
       
    60 the fast Knuth-Morris-Pratt algorithm for strings is not good enough
       
    61 for such string \emph{patterns}.\smallskip
    56 
    62 
    57 \defn{Regular expressions} help with conveniently specifying
    63 \defn{Regular expressions} help with conveniently specifying
    58 such patterns. The idea behind regular expressions is that
    64 such patterns. The idea behind regular expressions is that
    59 they are a simple method for describing languages (or sets of
    65 they are a simple method for describing languages (or sets of
    60 strings)\ldots at least languages we are interested in in
    66 strings)\ldots at least languages we are interested in in
   186 interest in studying them again in depth in this module? Well, one
   192 interest in studying them again in depth in this module? Well, one
   187 answer is in the following two graphs about regular expression
   193 answer is in the following two graphs about regular expression
   188 matching in Python, Ruby and Java.
   194 matching in Python, Ruby and Java.
   189 
   195 
   190 \begin{center}
   196 \begin{center}
   191 \begin{tabular}{@{\hspace{-1mm}}c@{\hspace{-1mm}}c@{}}
   197 \begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1mm}}c@{}}
       
   198 \begin{tikzpicture}
       
   199 \begin{axis}[
       
   200     title={Graph: $\texttt{(a*)*\,b}$ and strings 
       
   201            $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
       
   202     xlabel={$n$},
       
   203     x label style={at={(1.05,0.0)}},
       
   204     ylabel={time in secs},
       
   205     enlargelimits=false,
       
   206     xtick={0,5,...,30},
       
   207     xmax=33,
       
   208     ymax=35,
       
   209     ytick={0,5,...,30},
       
   210     scaled ticks=false,
       
   211     axis lines=left,
       
   212     width=5.5cm,
       
   213     height=4.5cm, 
       
   214     legend entries={Python, Java},  
       
   215     legend pos=north west,
       
   216     legend cell align=left]
       
   217 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
       
   218 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
       
   219 \end{axis}
       
   220 \end{tikzpicture}
       
   221 &
   192 \begin{tikzpicture}
   222 \begin{tikzpicture}
   193 \begin{axis}[
   223 \begin{axis}[
   194     title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings 
   224     title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings 
   195            $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
   225            $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
   196     xlabel={$n$},
   226     xlabel={$n$},
   210     legend cell align=left]
   240     legend cell align=left]
   211 \addplot[blue,mark=*, mark options={fill=white}] table {re-python.data};
   241 \addplot[blue,mark=*, mark options={fill=white}] table {re-python.data};
   212 \addplot[brown,mark=triangle*, mark options={fill=white}] table {re-ruby.data};  
   242 \addplot[brown,mark=triangle*, mark options={fill=white}] table {re-ruby.data};  
   213 \end{axis}
   243 \end{axis}
   214 \end{tikzpicture}
   244 \end{tikzpicture}
   215 &
       
   216 \begin{tikzpicture}
       
   217 \begin{axis}[
       
   218     title={Graph: $\texttt{(a*)*\,b}$ and strings 
       
   219            $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
       
   220     xlabel={$n$},
       
   221     x label style={at={(1.05,0.0)}},
       
   222     ylabel={time in secs},
       
   223     enlargelimits=false,
       
   224     xtick={0,5,...,30},
       
   225     xmax=33,
       
   226     ymax=35,
       
   227     ytick={0,5,...,30},
       
   228     scaled ticks=false,
       
   229     axis lines=left,
       
   230     width=5.5cm,
       
   231     height=4.5cm, 
       
   232     legend entries={Python, Java},  
       
   233     legend pos=north west,
       
   234     legend cell align=left]
       
   235 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
       
   236 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
       
   237 \end{axis}
       
   238 \end{tikzpicture}
       
   239 \end{tabular}
   245 \end{tabular}
   240 \end{center}
   246 \end{center}
   241 
   247 
   242 \noindent This first graph shows that Python needs approximately 29
   248 \noindent This first graph shows that Python and Java need
   243 seconds for finding out whether a string of 28 \texttt{a}s matches the
   249 approximately 30 seconds to find out that the regular expression
   244 regular expression \texttt{a?\{28\}\,a\{28\}}.  Ruby is even slightly
   250 $\texttt{(a*)*\,b}$ does not match strings of 28 \texttt{a}s.
       
   251 Similarly, the second shows that Python needs approximately 29 seconds
       
   252 for finding out whether a string of 28 \texttt{a}s matches the regular
       
   253 expression \texttt{a?\{28\}\,a\{28\}}.  Ruby is even slightly
   245 worse.\footnote{In this example Ruby uses the slightly different
   254 worse.\footnote{In this example Ruby uses the slightly different
   246   regular expression \texttt{a?a?a?...a?a?aaa...aa}, where the
   255   regular expression \texttt{a?a?a?...a?a?aaa...aa}, where the
   247   \texttt{a?} and \texttt{a} each occur $n$ times. More such test
   256   \texttt{a?} and \texttt{a} each occur $n$ times. More such test
   248   cases can be found at \url{http://www.computerbytesman.com/redos/}.}
   257   cases can be found at \url{http://www.computerbytesman.com/redos/}.}
   249 Simlarly, Python and Java needs approximately 30 seconds to find out
   258 Admittedly, these regular expressions are carefully chosen to exhibit
   250 that the regular expression $\texttt{(a*)*\,b}$ does not match strings
   259 this exponential behaviour, but similar ones occur more often than one
   251 of 28 \texttt{a}s.  Admittedly, these regular expressions are
   260 wants in ``real life''. For example, on 20 July 2016 a similar regular
   252 carefully chosen to exhibit this exponential behaviour, but similar
   261 expression brought the webpage \href{http://stackexchange.com}{Stack
   253 ones occur more often than one wants in ``real life''. For example, on
   262   Exchange} to its knees:
   254 20 July 2016 a similar regular expression brought the webpage
       
   255 \href{http://stackexchange.com}{Stack Exchange} to its knees:
       
   256 
   263 
   257 \begin{center}
   264 \begin{center}
   258 \url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}
   265 \url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}
   259 \end{center}
   266 \end{center}
   260 
   267 
   289 out why Python and Ruby (and others) behave so badly when
   296 out why Python and Ruby (and others) behave so badly when
   290 matching strings with evil regular expressions. But we will also look
   297 matching strings with evil regular expressions. But we will also look
   291 at a relatively simple algorithm that solves this problem much
   298 at a relatively simple algorithm that solves this problem much
   292 better than Python and Ruby do\ldots actually it will be two
   299 better than Python and Ruby do\ldots actually it will be two
   293 versions of the algorithm: the first one will be able to
   300 versions of the algorithm: the first one will be able to
   294 process strings of approximately 1,000 \texttt{a}s in 30
   301 process strings of approximately 1,100 \texttt{a}s in 23 
   295 seconds, while the second version will even be able to process
   302 seconds, while the second version will even be able to process
   296 up to 7,000(!) in 30 seconds, see the graph below:
   303 up to 11,000(!) in 5 seconds, see the graph below:
   297 
   304 
   298 \begin{center}
   305 \begin{center}
   299 \begin{tikzpicture}
   306 \begin{tikzpicture}
   300   \begin{axis}[
   307   \begin{axis}[
   301     title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings 
   308     title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings 
   302            $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
   309            $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
   303     xlabel={$n$},
   310     xlabel={$n$},
   304     x label style={at={(1.05,0.0)}},
   311     x label style={at={(1.05,0.0)}},
   305     ylabel={time in secs},
   312     ylabel={time in secs},
   306     enlargelimits=false,
   313     enlargelimits=false,
   307     xtick={0,3000,...,9000},
   314     xtick={0,3000,...,12000},
   308     xmax=10000,
   315     xmax=13000,
   309     ymax=32,
   316     ymax=32,
   310     ytick={0,5,...,30},
   317     ytick={0,5,...,30},
   311     scaled ticks=false,
   318     scaled ticks=false,
   312     axis lines=left,
   319     axis lines=left,
   313     width=7cm,
   320     width=7cm,
   529 
   536 
   530 The point of the definition of $L$ is that we can use it to
   537 The point of the definition of $L$ is that we can use it to
   531 precisely specify when a string $s$ is matched by a regular
   538 precisely specify when a string $s$ is matched by a regular
   532 expression $r$, namely if and only if $s \in L(r)$. In fact we
   539 expression $r$, namely if and only if $s \in L(r)$. In fact we
   533 will write a program \pcode{match} that takes any string $s$
   540 will write a program \pcode{match} that takes any string $s$
   534 and any regular expression $r$ as argument and returns
   541 and any regular expression $r$ as arguments and returns
   535 \emph{yes}, if $s \in L(r)$ and \emph{no}, if $s \not\in
   542 \emph{yes}, if $s \in L(r)$ and \emph{no}, if $s \not\in
   536 L(r)$. We leave this for the next lecture.
   543 L(r)$. We leave this for the next lecture.
   537 
   544 
   538 There is one more feature of regular expressions that is worth
   545 There is one more feature of regular expressions that is worth
   539 mentioning. Given some strings, there are in general many
   546 mentioning. Given some strings, there are in general many
   634 sometimes contains hidden snares. They have practical importance
   641 sometimes contains hidden snares. They have practical importance
   635 (remember the shocking runtime of the regular expression matchers in
   642 (remember the shocking runtime of the regular expression matchers in
   636 Python, Ruby and Java in some instances and the problems in Stack
   643 Python, Ruby and Java in some instances and the problems in Stack
   637 Exchange and the Atom editor).  People who are not very familiar with
   644 Exchange and the Atom editor).  People who are not very familiar with
   638 the mathematical background of regular expressions get them
   645 the mathematical background of regular expressions get them
   639 consistently wrong (surprising given they are a supposed to be core
   646 consistently wrong (this is surprising given they are a supposed to be
   640 skill for computer scientists). The hope is that we can do better in
   647 core skill for computer scientists). The hope is that we can do better
   641 the future---for example by proving that the algorithms actually
   648 in the future---for example by proving that the algorithms actually
   642 satisfy their specification and that the corresponding implementations
   649 satisfy their specification and that the corresponding implementations
   643 do not contain any bugs. We are close, but not yet quite there.
   650 do not contain any bugs. We are close, but not yet quite there.
   644 
   651 
   645 Notwithstanding my fascination, I am also happy to admit that regular
   652 Notwithstanding my fascination, I am also happy to admit that regular
   646 expressions have their shortcomings. There are some well-known
   653 expressions have their shortcomings. There are some well-known
   647 ``theoretical'' shortcomings, for example recognising strings of the
   654 ``theoretical'' shortcomings, for example recognising strings of the
   648 form $a^{n}b^{n}$ is not possible with regular expressions. This means
   655 form $a^{n}b^{n}$ is not possible with regular expressions. This means
   649 for example if we try to regognise whether parentheses are well-nested
   656 for example if we try to regognise whether parentheses are well-nested
   650 is impossible with (basic) regular expressions.  I am not so bothered
   657 in an expression is impossible with (basic) regular expressions.  I am
   651 by these shortcomings. What I am bothered about is when regular
   658 not so bothered by these shortcomings. What I am bothered about is
   652 expressions are in the way of practical programming. For example, it
   659 when regular expressions are in the way of practical programming. For
   653 turns out that the regular expression for email addresses shown in
   660 example, it turns out that the regular expression for email addresses
   654 \eqref{email} is hopelessly inadequate for recognising all of them
   661 shown in \eqref{email} is hopelessly inadequate for recognising all of
   655 (despite being touted as something every computer scientist should
   662 them (despite being touted as something every computer scientist
   656 know about). The W3 Consortium (which standardises the Web) proposes
   663 should know about). The W3 Consortium (which standardises the Web)
   657 to use the following, already more complicated regular expressions for
   664 proposes to use the following, already more complicated regular
   658 email addresses:
   665 expressions for email addresses:
   659 
   666 
   660 {\small\begin{lstlisting}[language={},keywordstyle=\color{black},numbers=none]
   667 {\small\begin{lstlisting}[language={},keywordstyle=\color{black},numbers=none]
   661 [a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*
   668 [a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*
   662 \end{lstlisting}}
   669 \end{lstlisting}}
   663 
   670 
   680 
   687 
   681 \noindent which explains some of the crazier parts of email
   688 \noindent which explains some of the crazier parts of email
   682 addresses. Still it is good to know that some tasks in text
   689 addresses. Still it is good to know that some tasks in text
   683 processing just cannot be achieved by using regular
   690 processing just cannot be achieved by using regular
   684 expressions. But for what we want to use them (lexing) they are
   691 expressions. But for what we want to use them (lexing) they are
   685 pretty good.
   692 pretty good.\medskip
   686 
   693 
   687 
   694 \noindent
   688 \begin{figure}[p]
   695 Finally there is a joke about regular expressions:
   689 \lstinputlisting{../progs/crawler1.scala}
   696 
       
   697 \begin{quote}\it
       
   698   ``Sometimes you have a programming problem and it seems like the
       
   699   best solution is to use regular expressions; now you have two
       
   700   problems.''
       
   701 \end{quote}  
       
   702 
       
   703 
       
   704 \begin{figure}[p]\small
       
   705   \lstinputlisting[linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
       
   706                   {../progs/crawler1.scala}
   690 
   707 
   691 \caption{The Scala code for a simple web-crawler that checks
   708 \caption{The Scala code for a simple web-crawler that checks
   692 for broken links in a web-page. It uses the regular expression
   709 for broken links in a web-page. It uses the regular expression
   693 \texttt{http\_pattern} in Line~\ref{httpline} for recognising
   710 \texttt{http\_pattern} in Line~\ref{httpline} for recognising
   694 URL-addresses. It finds all links using the library function
   711 URL-addresses. It finds all links using the library function
   695 \texttt{findAllIn} in Line~\ref{findallline}.\label{crawler1}}
   712 \texttt{findAllIn} in Line~\ref{findallline}.\label{crawler1}}
   696 
   713 
   697 \end{figure}
   714 \end{figure}
   698 
   715 
   699 
   716 
   700 \begin{figure}[p]
   717 
   701 \lstinputlisting{../progs/crawler2.scala}
   718 \begin{figure}[p]\small
       
   719   \lstinputlisting[linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
       
   720                   {../progs/crawler2.scala}
   702 
   721 
   703 \caption{A version of the web-crawler that only follows links
   722 \caption{A version of the web-crawler that only follows links
   704 in ``my'' domain---since these are the ones I am interested in
   723 in ``my'' domain---since these are the ones I am interested in
   705 to fix. It uses the regular expression \texttt{my\_urls} in
   724 to fix. It uses the regular expression \texttt{my\_urls} in
   706 Line~\ref{myurlline} to check for my name in the links. The
   725 Line~\ref{myurlline} to check for my name in the links. The
   709 is a test whether URL is in ``my'' domain or
   728 is a test whether URL is in ``my'' domain or
   710 not.\label{crawler2}}
   729 not.\label{crawler2}}
   711 
   730 
   712 \end{figure}
   731 \end{figure}
   713 
   732 
   714 \begin{figure}[p]
   733 \begin{figure}[p]\small
   715 \lstinputlisting{../progs/crawler3.scala}
   734   \lstinputlisting[linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
       
   735                   {../progs/crawler3.scala}
   716 
   736 
   717 \caption{A small email harvester---whenever we download a
   737 \caption{A small email harvester---whenever we download a
   718 web-page, we also check whether it contains any email
   738 web-page, we also check whether it contains any email
   719 addresses. For this we use the regular expression
   739 addresses. For this we use the regular expression
   720 \texttt{email\_pattern} in Line~\ref{emailline}. The main
   740 \texttt{email\_pattern} in Line~\ref{emailline}. The main