handouts/ho01.tex
changeset 477 b78664a24f5d
parent 473 dc528091eb70
child 492 39b7ff2cf1bc
equal deleted inserted replaced
476:d922cc83b70c 477:b78664a24f5d
     1 \documentclass{article}
     1 \documentclass{article}
     2 \usepackage{../style}
     2 \usepackage{../style}
     3 \usepackage{../langs}
     3 \usepackage{../langs}
     4 \usepackage{../graphics}
     4 \usepackage{../graphics}
     5 \usepackage{../data}
     5 \usepackage{../data}
       
     6 \usepackage{lstlinebgrd}
       
     7 \definecolor{capri}{rgb}{0.0, 0.75, 1.0}
     6 
     8 
     7 %%http://regexcrossword.com/challenges/cities/puzzles/1
     9 %%http://regexcrossword.com/challenges/cities/puzzles/1
     8 %%https://jex.im/regulex/
    10 %%https://jex.im/regulex/
     9 %%https://www.reddit.com/r/ProgrammingLanguages/comments/42dlem/mona_compiler_development_part_2_parsing/
    11 %%https://www.reddit.com/r/ProgrammingLanguages/comments/42dlem/mona_compiler_development_part_2_parsing/
    10 %%https://www.reddit.com/r/ProgrammingLanguages/comments/43wlkq/formal_grammar_for_csh_tsch_sh_or_bash/
    12 %%https://www.reddit.com/r/ProgrammingLanguages/comments/43wlkq/formal_grammar_for_csh_tsch_sh_or_bash/
    35 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017}
    37 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017}
    36 
    38 
    37 \section*{Handout 1}
    39 \section*{Handout 1}
    38 
    40 
    39 This module is about text processing, be it for web-crawlers,
    41 This module is about text processing, be it for web-crawlers,
    40 compilers, dictionaries, DNA-data and so on. When looking for a
    42 compilers, dictionaries, DNA-data and so on.  When looking for a
    41 particular string, like $abc$ in a large text we can use the
    43 particular string, like $abc$ in a large text we can use the
    42 Knuth-Morris-Pratt algorithm, which is currently the most efficient
    44 Knuth-Morris-Pratt algorithm, which is currently the most efficient
    43 general string search algorithm. But often we do \emph{not} just look
    45 general string search algorithm. But often we do \emph{not} just look
    44 for a particular string, but for string patterns. For example in
    46 for a particular string, but for string patterns. For example, in
    45 program code we need to identify what are the keywords (if, then,
    47 program code we need to identify what are the keywords (if, then,
    46 while, etc), what are the identifiers (variable names). A pattern for
    48 while, etc), what are the identifiers (variable names). A pattern for
    47 identifiers could be stated as: they start with a letter, followed by
    49 identifiers could be stated as: they start with a letter, followed by
    48 zero or more letters, numbers and underscores.  Also often we face the
    50 zero or more letters, numbers and underscores.  Often we also face the
    49 problem that we are given a string (for example some user input) and
    51 problem that we are given a string (for example some user input) and
    50 want to know whether it matches a particular pattern---be it an email
    52 want to know whether it matches a particular pattern---be it an email
    51 address, for example. In this way we can exclude user input that would
    53 address, for example. In this way we can exclude user input that would
    52 otherwise have nasty effects on our program (crashing it or making it
    54 otherwise have nasty effects on our program (crashing it or making it
    53 go into an infinite loop, if not worse). Scanning for computer viruses
    55 go into an infinite loop, if not worse). In tools like Snort, scanning
    54 or filtering out spam usually involves scanning for some signature
    56 for computer viruses or filtering out spam usually involves scanning
    55 (essentially a pattern).  The point is that the fast
    57 for some signature (essentially a string pattern).  The point is that
    56 Knuth-Morris-Pratt algorithm for strings is not good enough for such
    58 the fast Knuth-Morris-Pratt algorithm for strings is not good enough
    57 string \emph{patterns}.\smallskip
    59 for such string \emph{patterns}.\smallskip
    58 
    60 
    59 \defn{Regular expressions} help with conveniently specifying
    61 \defn{Regular expressions} help with conveniently specifying
    60 such patterns. The idea behind regular expressions is that
    62 such patterns. The idea behind regular expressions is that
    61 they are a simple method for describing languages (or sets of
    63 they are a simple method for describing languages (or sets of
    62 strings)\ldots at least languages we are interested in in
    64 strings)\ldots at least languages we are interested in in
   188 interest in studying them again in depth in this module? Well, one
   190 interest in studying them again in depth in this module? Well, one
   189 answer is in the following two graphs about regular expression
   191 answer is in the following two graphs about regular expression
   190 matching in Python, Ruby and Java.
   192 matching in Python, Ruby and Java.
   191 
   193 
   192 \begin{center}
   194 \begin{center}
   193 \begin{tabular}{@{\hspace{-1mm}}c@{\hspace{-1mm}}c@{}}
   195 \begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1mm}}c@{}}
       
   196 \begin{tikzpicture}
       
   197 \begin{axis}[
       
   198     title={Graph: $\texttt{(a*)*\,b}$ and strings 
       
   199            $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
       
   200     xlabel={$n$},
       
   201     x label style={at={(1.05,0.0)}},
       
   202     ylabel={time in secs},
       
   203     enlargelimits=false,
       
   204     xtick={0,5,...,30},
       
   205     xmax=33,
       
   206     ymax=35,
       
   207     ytick={0,5,...,30},
       
   208     scaled ticks=false,
       
   209     axis lines=left,
       
   210     width=5.5cm,
       
   211     height=4.5cm, 
       
   212     legend entries={Python, Java},  
       
   213     legend pos=north west,
       
   214     legend cell align=left]
       
   215 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
       
   216 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
       
   217 \end{axis}
       
   218 \end{tikzpicture}
       
   219 &
   194 \begin{tikzpicture}
   220 \begin{tikzpicture}
   195 \begin{axis}[
   221 \begin{axis}[
   196     title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings 
   222     title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings 
   197            $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
   223            $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
   198     xlabel={$n$},
   224     xlabel={$n$},
   212     legend cell align=left]
   238     legend cell align=left]
   213 \addplot[blue,mark=*, mark options={fill=white}] table {re-python.data};
   239 \addplot[blue,mark=*, mark options={fill=white}] table {re-python.data};
   214 \addplot[brown,mark=triangle*, mark options={fill=white}] table {re-ruby.data};  
   240 \addplot[brown,mark=triangle*, mark options={fill=white}] table {re-ruby.data};  
   215 \end{axis}
   241 \end{axis}
   216 \end{tikzpicture}
   242 \end{tikzpicture}
   217 &
       
   218 \begin{tikzpicture}
       
   219 \begin{axis}[
       
   220     title={Graph: $\texttt{(a*)*\,b}$ and strings 
       
   221            $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
       
   222     xlabel={$n$},
       
   223     x label style={at={(1.05,0.0)}},
       
   224     ylabel={time in secs},
       
   225     enlargelimits=false,
       
   226     xtick={0,5,...,30},
       
   227     xmax=33,
       
   228     ymax=35,
       
   229     ytick={0,5,...,30},
       
   230     scaled ticks=false,
       
   231     axis lines=left,
       
   232     width=5.5cm,
       
   233     height=4.5cm, 
       
   234     legend entries={Python, Java},  
       
   235     legend pos=north west,
       
   236     legend cell align=left]
       
   237 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
       
   238 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
       
   239 \end{axis}
       
   240 \end{tikzpicture}
       
   241 \end{tabular}
   243 \end{tabular}
   242 \end{center}
   244 \end{center}
   243 
   245 
   244 \noindent This first graph shows that Python needs approximately 29
   246 \noindent This first graph shows that Python and Java need
   245 seconds for finding out whether a string of 28 \texttt{a}s matches the
   247 approximately 30 seconds to find out that the regular expression
   246 regular expression \texttt{a?\{28\}\,a\{28\}}.  Ruby is even slightly
   248 $\texttt{(a*)*\,b}$ does not match strings of 28 \texttt{a}s.
       
   249 Similarly, the second shows that Python needs approximately 29 seconds
       
   250 for finding out whether a string of 28 \texttt{a}s matches the regular
       
   251 expression \texttt{a?\{28\}\,a\{28\}}.  Ruby is even slightly
   247 worse.\footnote{In this example Ruby uses the slightly different
   252 worse.\footnote{In this example Ruby uses the slightly different
   248   regular expression \texttt{a?a?a?...a?a?aaa...aa}, where the
   253   regular expression \texttt{a?a?a?...a?a?aaa...aa}, where the
   249   \texttt{a?} and \texttt{a} each occur $n$ times. More such test
   254   \texttt{a?} and \texttt{a} each occur $n$ times. More such test
   250   cases can be found at \url{http://www.computerbytesman.com/redos/}.}
   255   cases can be found at \url{http://www.computerbytesman.com/redos/}.}
   251 Simlarly, Python and Java needs approximately 30 seconds to find out
   256 Admittedly, these regular expressions are carefully chosen to exhibit
   252 that the regular expression $\texttt{(a*)*\,b}$ does not match strings
   257 this exponential behaviour, but similar ones occur more often than one
   253 of 28 \texttt{a}s.  Admittedly, these regular expressions are
   258 wants in ``real life''. For example, on 20 July 2016 a similar regular
   254 carefully chosen to exhibit this exponential behaviour, but similar
   259 expression brought the webpage \href{http://stackexchange.com}{Stack
   255 ones occur more often than one wants in ``real life''. For example, on
   260   Exchange} to its knees:
   256 20 July 2016 a similar regular expression brought the webpage
       
   257 \href{http://stackexchange.com}{Stack Exchange} to its knees:
       
   258 
   261 
   259 \begin{center}
   262 \begin{center}
   260 \url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}
   263 \url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}
   261 \end{center}
   264 \end{center}
   262 
   265 
   291 out why Python and Ruby (and others) behave so badly when
   294 out why Python and Ruby (and others) behave so badly when
   292 matching strings with evil regular expressions. But we will also look
   295 matching strings with evil regular expressions. But we will also look
   293 at a relatively simple algorithm that solves this problem much
   296 at a relatively simple algorithm that solves this problem much
   294 better than Python and Ruby do\ldots actually it will be two
   297 better than Python and Ruby do\ldots actually it will be two
   295 versions of the algorithm: the first one will be able to
   298 versions of the algorithm: the first one will be able to
   296 process strings of approximately 1,000 \texttt{a}s in 30
   299 process strings of approximately 1,100 \texttt{a}s in 23 
   297 seconds, while the second version will even be able to process
   300 seconds, while the second version will even be able to process
   298 up to 7,000(!) in 30 seconds, see the graph below:
   301 up to 11,000(!) in 5 seconds, see the graph below:
   299 
   302 
   300 \begin{center}
   303 \begin{center}
   301 \begin{tikzpicture}
   304 \begin{tikzpicture}
   302   \begin{axis}[
   305   \begin{axis}[
   303     title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings 
   306     title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings 
   304            $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
   307            $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
   305     xlabel={$n$},
   308     xlabel={$n$},
   306     x label style={at={(1.05,0.0)}},
   309     x label style={at={(1.05,0.0)}},
   307     ylabel={time in secs},
   310     ylabel={time in secs},
   308     enlargelimits=false,
   311     enlargelimits=false,
   309     xtick={0,3000,...,9000},
   312     xtick={0,3000,...,12000},
   310     xmax=10000,
   313     xmax=13000,
   311     ymax=32,
   314     ymax=32,
   312     ytick={0,5,...,30},
   315     ytick={0,5,...,30},
   313     scaled ticks=false,
   316     scaled ticks=false,
   314     axis lines=left,
   317     axis lines=left,
   315     width=7cm,
   318     width=7cm,
   694   best solution is to use regular expressions; now you have two
   697   best solution is to use regular expressions; now you have two
   695   problems.''
   698   problems.''
   696 \end{quote}  
   699 \end{quote}  
   697 
   700 
   698 
   701 
   699 \begin{figure}[p]
   702 \begin{figure}[p]\small
   700 \lstinputlisting{../progs/crawler1.scala}
   703   \lstinputlisting[linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
       
   704                   {../progs/crawler1.scala}
   701 
   705 
   702 \caption{The Scala code for a simple web-crawler that checks
   706 \caption{The Scala code for a simple web-crawler that checks
   703 for broken links in a web-page. It uses the regular expression
   707 for broken links in a web-page. It uses the regular expression
   704 \texttt{http\_pattern} in Line~\ref{httpline} for recognising
   708 \texttt{http\_pattern} in Line~\ref{httpline} for recognising
   705 URL-addresses. It finds all links using the library function
   709 URL-addresses. It finds all links using the library function
   706 \texttt{findAllIn} in Line~\ref{findallline}.\label{crawler1}}
   710 \texttt{findAllIn} in Line~\ref{findallline}.\label{crawler1}}
   707 
   711 
   708 \end{figure}
   712 \end{figure}
   709 
   713 
   710 
   714 
   711 \begin{figure}[p]
   715 
   712 \lstinputlisting{../progs/crawler2.scala}
   716 \begin{figure}[p]\small
       
   717   \lstinputlisting[linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
       
   718                   {../progs/crawler2.scala}
   713 
   719 
   714 \caption{A version of the web-crawler that only follows links
   720 \caption{A version of the web-crawler that only follows links
   715 in ``my'' domain---since these are the ones I am interested in
   721 in ``my'' domain---since these are the ones I am interested in
   716 to fix. It uses the regular expression \texttt{my\_urls} in
   722 to fix. It uses the regular expression \texttt{my\_urls} in
   717 Line~\ref{myurlline} to check for my name in the links. The
   723 Line~\ref{myurlline} to check for my name in the links. The
   720 is a test whether URL is in ``my'' domain or
   726 is a test whether URL is in ``my'' domain or
   721 not.\label{crawler2}}
   727 not.\label{crawler2}}
   722 
   728 
   723 \end{figure}
   729 \end{figure}
   724 
   730 
   725 \begin{figure}[p]
   731 \begin{figure}[p]\small
   726 \lstinputlisting{../progs/crawler3.scala}
   732   \lstinputlisting[linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
       
   733                   {../progs/crawler3.scala}
   727 
   734 
   728 \caption{A small email harvester---whenever we download a
   735 \caption{A small email harvester---whenever we download a
   729 web-page, we also check whether it contains any email
   736 web-page, we also check whether it contains any email
   730 addresses. For this we use the regular expression
   737 addresses. For this we use the regular expression
   731 \texttt{email\_pattern} in Line~\ref{emailline}. The main
   738 \texttt{email\_pattern} in Line~\ref{emailline}. The main