handouts/ho01.tex
changeset 504 5dc452d7c08e
parent 503 f2d7b885b3e3
parent 496 5c9de27a5b30
child 507 fdbc7d0ec04f
--- a/handouts/ho01.tex	Tue Sep 26 14:07:29 2017 +0100
+++ b/handouts/ho01.tex	Tue Sep 26 14:08:49 2017 +0100
@@ -3,6 +3,8 @@
 \usepackage{../langs}
 \usepackage{../graphics}
 \usepackage{../data}
+\usepackage{lstlinebgrd}
+\definecolor{capri}{rgb}{0.0, 0.75, 1.0}
 
 %%http://regexcrossword.com/challenges/cities/puzzles/1
 %%https://jex.im/regulex/
@@ -16,6 +18,8 @@
 
 %% email validator
 %% http://www.ex-parrot.com/%7Epdw/Mail-RFC822-Address.html
+% https://jackfoxy.github.io/FsRegEx/emailregex.html
+
 
 %% regex testers
 % https://regex101.com
@@ -37,22 +41,24 @@
 \section*{Handout 1}
 
 This module is about text processing, be it for web-crawlers,
-compilers, dictionaries, DNA-data and so on. When looking for a
+compilers, dictionaries, DNA-data, ad filters and so on.  When looking for a
 particular string, like $abc$ in a large text we can use the
 Knuth-Morris-Pratt algorithm, which is currently the most efficient
 general string search algorithm. But often we do \emph{not} just look
-for a particular string, but for string patterns. For example in
+for a particular string, but for string patterns. For example, in
 program code we need to identify what are the keywords (if, then,
-while, etc), what are the identifiers (variable names). A pattern for
+while, for, etc), what are the identifiers (variable names). A pattern for
 identifiers could be stated as: they start with a letter, followed by
-zero or more letters, numbers and underscores.  Also often we face the
+zero or more letters, numbers and underscores.  Often we also face the
 problem that we are given a string (for example some user input) and
 want to know whether it matches a particular pattern---be it an email
 address, for example. In this way we can exclude user input that would
 otherwise have nasty effects on our program (crashing it or making it
-go into an infinite loop, if not worse). The point is that the fast
-Knuth-Morris-Pratt algorithm for strings is not good enough for such
-string patterns.\smallskip
+go into an infinite loop, if not worse). In tools like Snort, scanning
+for computer viruses or filtering out spam usually involves scanning
+for some signature (essentially a string pattern).  The point is that
+the fast Knuth-Morris-Pratt algorithm for strings is not good enough
+for such string \emph{patterns}.\smallskip
 
 \defn{Regular expressions} help with conveniently specifying
 such patterns. The idea behind regular expressions is that
@@ -188,7 +194,31 @@
 matching in Python, Ruby and Java.
 
 \begin{center}
-\begin{tabular}{@{\hspace{-1mm}}c@{\hspace{-1mm}}c@{}}
+\begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1mm}}c@{}}
+\begin{tikzpicture}
+\begin{axis}[
+    title={Graph: $\texttt{(a*)*\,b}$ and strings 
+           $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
+    xlabel={$n$},
+    x label style={at={(1.05,0.0)}},
+    ylabel={time in secs},
+    enlargelimits=false,
+    xtick={0,5,...,30},
+    xmax=33,
+    ymax=35,
+    ytick={0,5,...,30},
+    scaled ticks=false,
+    axis lines=left,
+    width=5.5cm,
+    height=4.5cm, 
+    legend entries={Python, Java},  
+    legend pos=north west,
+    legend cell align=left]
+\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
+\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
+\end{axis}
+\end{tikzpicture}
+&
 \begin{tikzpicture}
 \begin{axis}[
     title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings 
@@ -212,47 +242,24 @@
 \addplot[brown,mark=triangle*, mark options={fill=white}] table {re-ruby.data};  
 \end{axis}
 \end{tikzpicture}
-&
-\begin{tikzpicture}
-\begin{axis}[
-    title={Graph: $\texttt{(a*)*\,b}$ and strings 
-           $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
-    xlabel={$n$},
-    x label style={at={(1.05,0.0)}},
-    ylabel={time in secs},
-    enlargelimits=false,
-    xtick={0,5,...,30},
-    xmax=33,
-    ymax=35,
-    ytick={0,5,...,30},
-    scaled ticks=false,
-    axis lines=left,
-    width=5.5cm,
-    height=4.5cm, 
-    legend entries={Python, Java},  
-    legend pos=north west,
-    legend cell align=left]
-\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
-\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
-\end{axis}
-\end{tikzpicture}
 \end{tabular}
 \end{center}
 
-\noindent This first graph shows that Python needs approximately 29
-seconds for finding out whether a string of 28 \texttt{a}s matches the
-regular expression \texttt{a?\{28\}\,a\{28\}}.  Ruby is even slightly
+\noindent This first graph shows that Python and Java need
+approximately 30 seconds to find out that the regular expression
+$\texttt{(a*)*\,b}$ does not match strings of 28 \texttt{a}s.
+Similarly, the second shows that Python needs approximately 29 seconds
+for finding out whether a string of 28 \texttt{a}s matches the regular
+expression \texttt{a?\{28\}\,a\{28\}}.  Ruby is even slightly
 worse.\footnote{In this example Ruby uses the slightly different
   regular expression \texttt{a?a?a?...a?a?aaa...aa}, where the
   \texttt{a?} and \texttt{a} each occur $n$ times. More such test
   cases can be found at \url{http://www.computerbytesman.com/redos/}.}
-Simlarly, Python and Java needs approximately 30 seconds to find out
-that the regular expression $\texttt{(a*)*\,b}$ does not match strings
-of 28 \texttt{a}s.  Admittedly, these regular expressions are
-carefully chosen to exhibit this exponential behaviour, but similar
-ones occur more often than one wants in ``real life''. For example, on
-20 July 2016 a similar regular expression brought the webpage
-\href{http://stackexchange.com}{Stack Exchange} to its knees:
+Admittedly, these regular expressions are carefully chosen to exhibit
+this exponential behaviour, but similar ones occur more often than one
+wants in ``real life''. For example, on 20 July 2016 a similar regular
+expression brought the webpage \href{http://stackexchange.com}{Stack
+  Exchange} to its knees:
 
 \begin{center}
 \url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}
@@ -291,9 +298,9 @@
 at a relatively simple algorithm that solves this problem much
 better than Python and Ruby do\ldots actually it will be two
 versions of the algorithm: the first one will be able to
-process strings of approximately 1,000 \texttt{a}s in 30
+process strings of approximately 1,100 \texttt{a}s in 23 
 seconds, while the second version will even be able to process
-up to 7,000(!) in 30 seconds, see the graph below:
+up to 11,000(!) in 5 seconds, see the graph below:
 
 \begin{center}
 \begin{tikzpicture}
@@ -304,8 +311,8 @@
     x label style={at={(1.05,0.0)}},
     ylabel={time in secs},
     enlargelimits=false,
-    xtick={0,3000,...,9000},
-    xmax=10000,
+    xtick={0,3000,...,12000},
+    xmax=13000,
     ymax=32,
     ytick={0,5,...,30},
     scaled ticks=false,
@@ -531,7 +538,7 @@
 precisely specify when a string $s$ is matched by a regular
 expression $r$, namely if and only if $s \in L(r)$. In fact we
 will write a program \pcode{match} that takes any string $s$
-and any regular expression $r$ as argument and returns
+and any regular expression $r$ as arguments and returns
 \emph{yes}, if $s \in L(r)$ and \emph{no}, if $s \not\in
 L(r)$. We leave this for the next lecture.
 
@@ -636,9 +643,9 @@
 Python, Ruby and Java in some instances and the problems in Stack
 Exchange and the Atom editor).  People who are not very familiar with
 the mathematical background of regular expressions get them
-consistently wrong (surprising given they are a supposed to be core
-skill for computer scientists). The hope is that we can do better in
-the future---for example by proving that the algorithms actually
+consistently wrong (this is surprising given they are a supposed to be
+core skill for computer scientists). The hope is that we can do better
+in the future---for example by proving that the algorithms actually
 satisfy their specification and that the corresponding implementations
 do not contain any bugs. We are close, but not yet quite there.
 
@@ -647,15 +654,15 @@
 ``theoretical'' shortcomings, for example recognising strings of the
 form $a^{n}b^{n}$ is not possible with regular expressions. This means
 for example if we try to regognise whether parentheses are well-nested
-is impossible with (basic) regular expressions.  I am not so bothered
-by these shortcomings. What I am bothered about is when regular
-expressions are in the way of practical programming. For example, it
-turns out that the regular expression for email addresses shown in
-\eqref{email} is hopelessly inadequate for recognising all of them
-(despite being touted as something every computer scientist should
-know about). The W3 Consortium (which standardises the Web) proposes
-to use the following, already more complicated regular expressions for
-email addresses:
+in an expression is impossible with (basic) regular expressions.  I am
+not so bothered by these shortcomings. What I am bothered about is
+when regular expressions are in the way of practical programming. For
+example, it turns out that the regular expression for email addresses
+shown in \eqref{email} is hopelessly inadequate for recognising all of
+them (despite being touted as something every computer scientist
+should know about). The W3 Consortium (which standardises the Web)
+proposes to use the following, already more complicated regular
+expressions for email addresses:
 
 {\small\begin{lstlisting}[language={},keywordstyle=\color{black},numbers=none]
 [a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*
@@ -682,11 +689,21 @@
 addresses. Still it is good to know that some tasks in text
 processing just cannot be achieved by using regular
 expressions. But for what we want to use them (lexing) they are
-pretty good.
+pretty good.\medskip
+
+\noindent
+Finally there is a joke about regular expressions:
+
+\begin{quote}\it
+  ``Sometimes you have a programming problem and it seems like the
+  best solution is to use regular expressions; now you have two
+  problems.''
+\end{quote}  
 
 
-\begin{figure}[p]
-\lstinputlisting{../progs/crawler1.scala}
+\begin{figure}[p]\small
+  \lstinputlisting[linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
+                  {../progs/crawler1.scala}
 
 \caption{The Scala code for a simple web-crawler that checks
 for broken links in a web-page. It uses the regular expression
@@ -697,8 +714,10 @@
 \end{figure}
 
 
-\begin{figure}[p]
-\lstinputlisting{../progs/crawler2.scala}
+
+\begin{figure}[p]\small
+  \lstinputlisting[linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
+                  {../progs/crawler2.scala}
 
 \caption{A version of the web-crawler that only follows links
 in ``my'' domain---since these are the ones I am interested in
@@ -711,8 +730,9 @@
 
 \end{figure}
 
-\begin{figure}[p]
-\lstinputlisting{../progs/crawler3.scala}
+\begin{figure}[p]\small
+  \lstinputlisting[linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
+                  {../progs/crawler3.scala}
 
 \caption{A small email harvester---whenever we download a
 web-page, we also check whether it contains any email