handouts/ho01.tex
changeset 477 b78664a24f5d
parent 473 dc528091eb70
child 492 39b7ff2cf1bc
--- a/handouts/ho01.tex	Mon Feb 13 23:22:45 2017 +0000
+++ b/handouts/ho01.tex	Wed Mar 15 01:24:39 2017 +0000
@@ -3,6 +3,8 @@
 \usepackage{../langs}
 \usepackage{../graphics}
 \usepackage{../data}
+\usepackage{lstlinebgrd}
+\definecolor{capri}{rgb}{0.0, 0.75, 1.0}
 
 %%http://regexcrossword.com/challenges/cities/puzzles/1
 %%https://jex.im/regulex/
@@ -37,24 +39,24 @@
 \section*{Handout 1}
 
 This module is about text processing, be it for web-crawlers,
-compilers, dictionaries, DNA-data and so on. When looking for a
+compilers, dictionaries, DNA-data and so on.  When looking for a
 particular string, like $abc$ in a large text we can use the
 Knuth-Morris-Pratt algorithm, which is currently the most efficient
 general string search algorithm. But often we do \emph{not} just look
-for a particular string, but for string patterns. For example in
+for a particular string, but for string patterns. For example, in
 program code we need to identify what are the keywords (if, then,
 while, etc), what are the identifiers (variable names). A pattern for
 identifiers could be stated as: they start with a letter, followed by
-zero or more letters, numbers and underscores.  Also often we face the
+zero or more letters, numbers and underscores.  Often we also face the
 problem that we are given a string (for example some user input) and
 want to know whether it matches a particular pattern---be it an email
 address, for example. In this way we can exclude user input that would
 otherwise have nasty effects on our program (crashing it or making it
-go into an infinite loop, if not worse). Scanning for computer viruses
-or filtering out spam usually involves scanning for some signature
-(essentially a pattern).  The point is that the fast
-Knuth-Morris-Pratt algorithm for strings is not good enough for such
-string \emph{patterns}.\smallskip
+go into an infinite loop, if not worse). In tools like Snort, scanning
+for computer viruses or filtering out spam usually involves scanning
+for some signature (essentially a string pattern).  The point is that
+the fast Knuth-Morris-Pratt algorithm for strings is not good enough
+for such string \emph{patterns}.\smallskip
 
 \defn{Regular expressions} help with conveniently specifying
 such patterns. The idea behind regular expressions is that
@@ -190,7 +192,31 @@
 matching in Python, Ruby and Java.
 
 \begin{center}
-\begin{tabular}{@{\hspace{-1mm}}c@{\hspace{-1mm}}c@{}}
+\begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1mm}}c@{}}
+\begin{tikzpicture}
+\begin{axis}[
+    title={Graph: $\texttt{(a*)*\,b}$ and strings 
+           $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
+    xlabel={$n$},
+    x label style={at={(1.05,0.0)}},
+    ylabel={time in secs},
+    enlargelimits=false,
+    xtick={0,5,...,30},
+    xmax=33,
+    ymax=35,
+    ytick={0,5,...,30},
+    scaled ticks=false,
+    axis lines=left,
+    width=5.5cm,
+    height=4.5cm, 
+    legend entries={Python, Java},  
+    legend pos=north west,
+    legend cell align=left]
+\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
+\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
+\end{axis}
+\end{tikzpicture}
+&
 \begin{tikzpicture}
 \begin{axis}[
     title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings 
@@ -214,47 +240,24 @@
 \addplot[brown,mark=triangle*, mark options={fill=white}] table {re-ruby.data};  
 \end{axis}
 \end{tikzpicture}
-&
-\begin{tikzpicture}
-\begin{axis}[
-    title={Graph: $\texttt{(a*)*\,b}$ and strings 
-           $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
-    xlabel={$n$},
-    x label style={at={(1.05,0.0)}},
-    ylabel={time in secs},
-    enlargelimits=false,
-    xtick={0,5,...,30},
-    xmax=33,
-    ymax=35,
-    ytick={0,5,...,30},
-    scaled ticks=false,
-    axis lines=left,
-    width=5.5cm,
-    height=4.5cm, 
-    legend entries={Python, Java},  
-    legend pos=north west,
-    legend cell align=left]
-\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
-\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
-\end{axis}
-\end{tikzpicture}
 \end{tabular}
 \end{center}
 
-\noindent This first graph shows that Python needs approximately 29
-seconds for finding out whether a string of 28 \texttt{a}s matches the
-regular expression \texttt{a?\{28\}\,a\{28\}}.  Ruby is even slightly
+\noindent This first graph shows that Python and Java need
+approximately 30 seconds to find out that the regular expression
+$\texttt{(a*)*\,b}$ does not match strings of 28 \texttt{a}s.
+Similarly, the second shows that Python needs approximately 29 seconds
+for finding out whether a string of 28 \texttt{a}s matches the regular
+expression \texttt{a?\{28\}\,a\{28\}}.  Ruby is even slightly
 worse.\footnote{In this example Ruby uses the slightly different
   regular expression \texttt{a?a?a?...a?a?aaa...aa}, where the
   \texttt{a?} and \texttt{a} each occur $n$ times. More such test
   cases can be found at \url{http://www.computerbytesman.com/redos/}.}
-Simlarly, Python and Java needs approximately 30 seconds to find out
-that the regular expression $\texttt{(a*)*\,b}$ does not match strings
-of 28 \texttt{a}s.  Admittedly, these regular expressions are
-carefully chosen to exhibit this exponential behaviour, but similar
-ones occur more often than one wants in ``real life''. For example, on
-20 July 2016 a similar regular expression brought the webpage
-\href{http://stackexchange.com}{Stack Exchange} to its knees:
+Admittedly, these regular expressions are carefully chosen to exhibit
+this exponential behaviour, but similar ones occur more often than one
+wants in ``real life''. For example, on 20 July 2016 a similar regular
+expression brought the webpage \href{http://stackexchange.com}{Stack
+  Exchange} to its knees:
 
 \begin{center}
 \url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}
@@ -293,9 +296,9 @@
 at a relatively simple algorithm that solves this problem much
 better than Python and Ruby do\ldots actually it will be two
 versions of the algorithm: the first one will be able to
-process strings of approximately 1,000 \texttt{a}s in 30
+process strings of approximately 1,100 \texttt{a}s in 23 
 seconds, while the second version will even be able to process
-up to 7,000(!) in 30 seconds, see the graph below:
+up to 11,000(!) in 5 seconds, see the graph below:
 
 \begin{center}
 \begin{tikzpicture}
@@ -306,8 +309,8 @@
     x label style={at={(1.05,0.0)}},
     ylabel={time in secs},
     enlargelimits=false,
-    xtick={0,3000,...,9000},
-    xmax=10000,
+    xtick={0,3000,...,12000},
+    xmax=13000,
     ymax=32,
     ytick={0,5,...,30},
     scaled ticks=false,
@@ -696,8 +699,9 @@
 \end{quote}  
 
 
-\begin{figure}[p]
-\lstinputlisting{../progs/crawler1.scala}
+\begin{figure}[p]\small
+  \lstinputlisting[linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
+                  {../progs/crawler1.scala}
 
 \caption{The Scala code for a simple web-crawler that checks
 for broken links in a web-page. It uses the regular expression
@@ -708,8 +712,10 @@
 \end{figure}
 
 
-\begin{figure}[p]
-\lstinputlisting{../progs/crawler2.scala}
+
+\begin{figure}[p]\small
+  \lstinputlisting[linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
+                  {../progs/crawler2.scala}
 
 \caption{A version of the web-crawler that only follows links
 in ``my'' domain---since these are the ones I am interested in
@@ -722,8 +728,9 @@
 
 \end{figure}
 
-\begin{figure}[p]
-\lstinputlisting{../progs/crawler3.scala}
+\begin{figure}[p]\small
+  \lstinputlisting[linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
+                  {../progs/crawler3.scala}
 
 \caption{A small email harvester---whenever we download a
 web-page, we also check whether it contains any email