afl-material: comparison handouts/ho01.tex

equal deleted inserted replaced

-:d922cc83b70c
+:b78664a24f5d
 \documentclass{article}
 \usepackage{../style}
 \usepackage{../langs}
 \usepackage{../graphics}
 \usepackage{../data}
+\usepackage{lstlinebgrd}
+\definecolor{capri}{rgb}{0.0, 0.75, 1.0}
 %%http://regexcrossword.com/challenges/cities/puzzles/1
 %%https://jex.im/regulex/
 %%https://www.reddit.com/r/ProgrammingLanguages/comments/42dlem/mona_compiler_development_part_2_parsing/
 %%https://www.reddit.com/r/ProgrammingLanguages/comments/43wlkq/formal_grammar_for_csh_tsch_sh_or_bash/
 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017}
 \section*{Handout 1}
 This module is about text processing, be it for web-crawlers,
-compilers, dictionaries, DNA-data and so on. When looking for a
+compilers, dictionaries, DNA-data and so on.  When looking for a
 particular string, like $abc$ in a large text we can use the
 Knuth-Morris-Pratt algorithm, which is currently the most efficient
 general string search algorithm. But often we do \emph{not} just look
-for a particular string, but for string patterns. For example in
+for a particular string, but for string patterns. For example, in
 program code we need to identify what are the keywords (if, then,
 while, etc), what are the identifiers (variable names). A pattern for
 identifiers could be stated as: they start with a letter, followed by
-zero or more letters, numbers and underscores.  Also often we face the
+zero or more letters, numbers and underscores.  Often we also face the
 problem that we are given a string (for example some user input) and
 want to know whether it matches a particular pattern---be it an email
 address, for example. In this way we can exclude user input that would
 otherwise have nasty effects on our program (crashing it or making it
-go into an infinite loop, if not worse). Scanning for computer viruses
+go into an infinite loop, if not worse). In tools like Snort, scanning
-or filtering out spam usually involves scanning for some signature
+for computer viruses or filtering out spam usually involves scanning
-(essentially a pattern).  The point is that the fast
+for some signature (essentially a string pattern).  The point is that
-Knuth-Morris-Pratt algorithm for strings is not good enough for such
+the fast Knuth-Morris-Pratt algorithm for strings is not good enough
-string \emph{patterns}.\smallskip
+for such string \emph{patterns}.\smallskip
 \defn{Regular expressions} help with conveniently specifying
 such patterns. The idea behind regular expressions is that
 they are a simple method for describing languages (or sets of
 strings)\ldots at least languages we are interested in in
 interest in studying them again in depth in this module? Well, one
 answer is in the following two graphs about regular expression
 matching in Python, Ruby and Java.
 \begin{center}
-\begin{tabular}{@{\hspace{-1mm}}c@{\hspace{-1mm}}c@{}}
+\begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1mm}}c@{}}
+\begin{tikzpicture}
+\begin{axis}[
+title={Graph: $\texttt{(a*)*\,b}$ and strings
+$\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
+xlabel={$n$},
+x label style={at={(1.05,0.0)}},
+ylabel={time in secs},
+enlargelimits=false,
+xtick={0,5,...,30},
+xmax=33,
+ymax=35,
+ytick={0,5,...,30},
+scaled ticks=false,
+axis lines=left,
+width=5.5cm,
+height=4.5cm,
+legend entries={Python, Java},
+legend pos=north west,
+legend cell align=left]
+\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
+\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
+\end{axis}
+\end{tikzpicture}
+&
 \begin{tikzpicture}
 \begin{axis}[
 title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings
 $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
 xlabel={$n$},
 legend cell align=left]
 \addplot[blue,mark=*, mark options={fill=white}] table {re-python.data};
 \addplot[brown,mark=triangle*, mark options={fill=white}] table {re-ruby.data};
 \end{axis}
 \end{tikzpicture}
-&
-\begin{tikzpicture}
-\begin{axis}[
-title={Graph: $\texttt{(a*)*\,b}$ and strings
-$\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
-xlabel={$n$},
-x label style={at={(1.05,0.0)}},
-ylabel={time in secs},
-enlargelimits=false,
-xtick={0,5,...,30},
-xmax=33,
-ymax=35,
-ytick={0,5,...,30},
-scaled ticks=false,
-axis lines=left,
-width=5.5cm,
-height=4.5cm,
-legend entries={Python, Java},
-legend pos=north west,
-legend cell align=left]
-\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
-\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
-\end{axis}
-\end{tikzpicture}
 \end{tabular}
 \end{center}
-\noindent This first graph shows that Python needs approximately 29
+\noindent This first graph shows that Python and Java need
-seconds for finding out whether a string of 28 \texttt{a}s matches the
+approximately 30 seconds to find out that the regular expression
-regular expression \texttt{a?\{28\}\,a\{28\}}.  Ruby is even slightly
+$\texttt{(a*)*\,b}$ does not match strings of 28 \texttt{a}s.
+Similarly, the second shows that Python needs approximately 29 seconds
+for finding out whether a string of 28 \texttt{a}s matches the regular
+expression \texttt{a?\{28\}\,a\{28\}}.  Ruby is even slightly
 worse.\footnote{In this example Ruby uses the slightly different
 regular expression \texttt{a?a?a?...a?a?aaa...aa}, where the
 \texttt{a?} and \texttt{a} each occur $n$ times. More such test
 cases can be found at \url{http://www.computerbytesman.com/redos/}.}
-Simlarly, Python and Java needs approximately 30 seconds to find out
+Admittedly, these regular expressions are carefully chosen to exhibit
-that the regular expression $\texttt{(a*)*\,b}$ does not match strings
+this exponential behaviour, but similar ones occur more often than one
-of 28 \texttt{a}s.  Admittedly, these regular expressions are
+wants in ``real life''. For example, on 20 July 2016 a similar regular
-carefully chosen to exhibit this exponential behaviour, but similar
+expression brought the webpage \href{http://stackexchange.com}{Stack
-ones occur more often than one wants in ``real life''. For example, on
+Exchange} to its knees:
-20 July 2016 a similar regular expression brought the webpage
-\href{http://stackexchange.com}{Stack Exchange} to its knees:
 \begin{center}
 \url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}
 \end{center}
 out why Python and Ruby (and others) behave so badly when
 matching strings with evil regular expressions. But we will also look
 at a relatively simple algorithm that solves this problem much
 better than Python and Ruby do\ldots actually it will be two
 versions of the algorithm: the first one will be able to
-process strings of approximately 1,000 \texttt{a}s in 30
+process strings of approximately 1,100 \texttt{a}s in 23
 seconds, while the second version will even be able to process
-up to 7,000(!) in 30 seconds, see the graph below:
+up to 11,000(!) in 5 seconds, see the graph below:
 \begin{center}
 \begin{tikzpicture}
 \begin{axis}[
 title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings
 $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
 xlabel={$n$},
 x label style={at={(1.05,0.0)}},
 ylabel={time in secs},
 enlargelimits=false,
-xtick={0,3000,...,9000},
+xtick={0,3000,...,12000},
-xmax=10000,
+xmax=13000,
 ymax=32,
 ytick={0,5,...,30},
 scaled ticks=false,
 axis lines=left,
 width=7cm,
 best solution is to use regular expressions; now you have two
 problems.''
 \end{quote}
-\begin{figure}[p]
+\begin{figure}[p]\small
-\lstinputlisting{../progs/crawler1.scala}
+\lstinputlisting[linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
+{../progs/crawler1.scala}
 \caption{The Scala code for a simple web-crawler that checks
 for broken links in a web-page. It uses the regular expression
 \texttt{http\_pattern} in Line~\ref{httpline} for recognising
 URL-addresses. It finds all links using the library function
 \texttt{findAllIn} in Line~\ref{findallline}.\label{crawler1}}
 \end{figure}
-\begin{figure}[p]
-\lstinputlisting{../progs/crawler2.scala}
+\begin{figure}[p]\small
+\lstinputlisting[linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
+{../progs/crawler2.scala}
 \caption{A version of the web-crawler that only follows links
 in ``my'' domain---since these are the ones I am interested in
 to fix. It uses the regular expression \texttt{my\_urls} in
 Line~\ref{myurlline} to check for my name in the links. The
 is a test whether URL is in ``my'' domain or
 not.\label{crawler2}}
 \end{figure}
-\begin{figure}[p]
+\begin{figure}[p]\small
-\lstinputlisting{../progs/crawler3.scala}
+\lstinputlisting[linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
+{../progs/crawler3.scala}
 \caption{A small email harvester---whenever we download a
 web-page, we also check whether it contains any email
 addresses. For this we use the regular expression
 \texttt{email\_pattern} in Line~\ref{emailline}. The main

changeset 477	b78664a24f5d
parent 473	dc528091eb70
child 492	39b7ff2cf1bc