--- a/handouts/ho01.tex Mon Feb 13 23:22:45 2017 +0000
+++ b/handouts/ho01.tex Wed Mar 15 01:24:39 2017 +0000
@@ -3,6 +3,8 @@
\usepackage{../langs}
\usepackage{../graphics}
\usepackage{../data}
+\usepackage{lstlinebgrd}
+\definecolor{capri}{rgb}{0.0, 0.75, 1.0}
%%http://regexcrossword.com/challenges/cities/puzzles/1
%%https://jex.im/regulex/
@@ -37,24 +39,24 @@
\section*{Handout 1}
This module is about text processing, be it for web-crawlers,
-compilers, dictionaries, DNA-data and so on. When looking for a
+compilers, dictionaries, DNA-data and so on. When looking for a
particular string, like $abc$ in a large text we can use the
Knuth-Morris-Pratt algorithm, which is currently the most efficient
general string search algorithm. But often we do \emph{not} just look
-for a particular string, but for string patterns. For example in
+for a particular string, but for string patterns. For example, in
program code we need to identify what are the keywords (if, then,
while, etc), what are the identifiers (variable names). A pattern for
identifiers could be stated as: they start with a letter, followed by
-zero or more letters, numbers and underscores. Also often we face the
+zero or more letters, numbers and underscores. Often we also face the
problem that we are given a string (for example some user input) and
want to know whether it matches a particular pattern---be it an email
address, for example. In this way we can exclude user input that would
otherwise have nasty effects on our program (crashing it or making it
-go into an infinite loop, if not worse). Scanning for computer viruses
-or filtering out spam usually involves scanning for some signature
-(essentially a pattern). The point is that the fast
-Knuth-Morris-Pratt algorithm for strings is not good enough for such
-string \emph{patterns}.\smallskip
+go into an infinite loop, if not worse). In tools like Snort, scanning
+for computer viruses or filtering out spam usually involves scanning
+for some signature (essentially a string pattern). The point is that
+the fast Knuth-Morris-Pratt algorithm for strings is not good enough
+for such string \emph{patterns}.\smallskip
\defn{Regular expressions} help with conveniently specifying
such patterns. The idea behind regular expressions is that
@@ -190,7 +192,31 @@
matching in Python, Ruby and Java.
\begin{center}
-\begin{tabular}{@{\hspace{-1mm}}c@{\hspace{-1mm}}c@{}}
+\begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1mm}}c@{}}
+\begin{tikzpicture}
+\begin{axis}[
+ title={Graph: $\texttt{(a*)*\,b}$ and strings
+ $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
+ xlabel={$n$},
+ x label style={at={(1.05,0.0)}},
+ ylabel={time in secs},
+ enlargelimits=false,
+ xtick={0,5,...,30},
+ xmax=33,
+ ymax=35,
+ ytick={0,5,...,30},
+ scaled ticks=false,
+ axis lines=left,
+ width=5.5cm,
+ height=4.5cm,
+ legend entries={Python, Java},
+ legend pos=north west,
+ legend cell align=left]
+\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
+\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
+\end{axis}
+\end{tikzpicture}
+&
\begin{tikzpicture}
\begin{axis}[
title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings
@@ -214,47 +240,24 @@
\addplot[brown,mark=triangle*, mark options={fill=white}] table {re-ruby.data};
\end{axis}
\end{tikzpicture}
-&
-\begin{tikzpicture}
-\begin{axis}[
- title={Graph: $\texttt{(a*)*\,b}$ and strings
- $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
- xlabel={$n$},
- x label style={at={(1.05,0.0)}},
- ylabel={time in secs},
- enlargelimits=false,
- xtick={0,5,...,30},
- xmax=33,
- ymax=35,
- ytick={0,5,...,30},
- scaled ticks=false,
- axis lines=left,
- width=5.5cm,
- height=4.5cm,
- legend entries={Python, Java},
- legend pos=north west,
- legend cell align=left]
-\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
-\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
-\end{axis}
-\end{tikzpicture}
\end{tabular}
\end{center}
-\noindent This first graph shows that Python needs approximately 29
-seconds for finding out whether a string of 28 \texttt{a}s matches the
-regular expression \texttt{a?\{28\}\,a\{28\}}. Ruby is even slightly
+\noindent This first graph shows that Python and Java need
+approximately 30 seconds to find out that the regular expression
+$\texttt{(a*)*\,b}$ does not match strings of 28 \texttt{a}s.
+Similarly, the second shows that Python needs approximately 29 seconds
+for finding out whether a string of 28 \texttt{a}s matches the regular
+expression \texttt{a?\{28\}\,a\{28\}}. Ruby is even slightly
worse.\footnote{In this example Ruby uses the slightly different
regular expression \texttt{a?a?a?...a?a?aaa...aa}, where the
\texttt{a?} and \texttt{a} each occur $n$ times. More such test
cases can be found at \url{http://www.computerbytesman.com/redos/}.}
-Simlarly, Python and Java needs approximately 30 seconds to find out
-that the regular expression $\texttt{(a*)*\,b}$ does not match strings
-of 28 \texttt{a}s. Admittedly, these regular expressions are
-carefully chosen to exhibit this exponential behaviour, but similar
-ones occur more often than one wants in ``real life''. For example, on
-20 July 2016 a similar regular expression brought the webpage
-\href{http://stackexchange.com}{Stack Exchange} to its knees:
+Admittedly, these regular expressions are carefully chosen to exhibit
+this exponential behaviour, but similar ones occur more often than one
+wants in ``real life''. For example, on 20 July 2016 a similar regular
+expression brought the webpage \href{http://stackexchange.com}{Stack
+ Exchange} to its knees:
\begin{center}
\url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}
@@ -293,9 +296,9 @@
at a relatively simple algorithm that solves this problem much
better than Python and Ruby do\ldots actually it will be two
versions of the algorithm: the first one will be able to
-process strings of approximately 1,000 \texttt{a}s in 30
+process strings of approximately 1,100 \texttt{a}s in 23
seconds, while the second version will even be able to process
-up to 7,000(!) in 30 seconds, see the graph below:
+up to 11,000(!) in 5 seconds, see the graph below:
\begin{center}
\begin{tikzpicture}
@@ -306,8 +309,8 @@
x label style={at={(1.05,0.0)}},
ylabel={time in secs},
enlargelimits=false,
- xtick={0,3000,...,9000},
- xmax=10000,
+ xtick={0,3000,...,12000},
+ xmax=13000,
ymax=32,
ytick={0,5,...,30},
scaled ticks=false,
@@ -696,8 +699,9 @@
\end{quote}
-\begin{figure}[p]
-\lstinputlisting{../progs/crawler1.scala}
+\begin{figure}[p]\small
+ \lstinputlisting[linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
+ {../progs/crawler1.scala}
\caption{The Scala code for a simple web-crawler that checks
for broken links in a web-page. It uses the regular expression
@@ -708,8 +712,10 @@
\end{figure}
-\begin{figure}[p]
-\lstinputlisting{../progs/crawler2.scala}
+
+\begin{figure}[p]\small
+ \lstinputlisting[linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
+ {../progs/crawler2.scala}
\caption{A version of the web-crawler that only follows links
in ``my'' domain---since these are the ones I am interested in
@@ -722,8 +728,9 @@
\end{figure}
-\begin{figure}[p]
-\lstinputlisting{../progs/crawler3.scala}
+\begin{figure}[p]\small
+ \lstinputlisting[linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
+ {../progs/crawler3.scala}
\caption{A small email harvester---whenever we download a
web-page, we also check whether it contains any email