diff -r d922cc83b70c -r b78664a24f5d handouts/ho01.tex --- a/handouts/ho01.tex Mon Feb 13 23:22:45 2017 +0000 +++ b/handouts/ho01.tex Wed Mar 15 01:24:39 2017 +0000 @@ -3,6 +3,8 @@ \usepackage{../langs} \usepackage{../graphics} \usepackage{../data} +\usepackage{lstlinebgrd} +\definecolor{capri}{rgb}{0.0, 0.75, 1.0} %%http://regexcrossword.com/challenges/cities/puzzles/1 %%https://jex.im/regulex/ @@ -37,24 +39,24 @@ \section*{Handout 1} This module is about text processing, be it for web-crawlers, -compilers, dictionaries, DNA-data and so on. When looking for a +compilers, dictionaries, DNA-data and so on. When looking for a particular string, like $abc$ in a large text we can use the Knuth-Morris-Pratt algorithm, which is currently the most efficient general string search algorithm. But often we do \emph{not} just look -for a particular string, but for string patterns. For example in +for a particular string, but for string patterns. For example, in program code we need to identify what are the keywords (if, then, while, etc), what are the identifiers (variable names). A pattern for identifiers could be stated as: they start with a letter, followed by -zero or more letters, numbers and underscores. Also often we face the +zero or more letters, numbers and underscores. Often we also face the problem that we are given a string (for example some user input) and want to know whether it matches a particular pattern---be it an email address, for example. In this way we can exclude user input that would otherwise have nasty effects on our program (crashing it or making it -go into an infinite loop, if not worse). Scanning for computer viruses -or filtering out spam usually involves scanning for some signature -(essentially a pattern). The point is that the fast -Knuth-Morris-Pratt algorithm for strings is not good enough for such -string \emph{patterns}.\smallskip +go into an infinite loop, if not worse). In tools like Snort, scanning +for computer viruses or filtering out spam usually involves scanning +for some signature (essentially a string pattern). The point is that +the fast Knuth-Morris-Pratt algorithm for strings is not good enough +for such string \emph{patterns}.\smallskip \defn{Regular expressions} help with conveniently specifying such patterns. The idea behind regular expressions is that @@ -190,7 +192,31 @@ matching in Python, Ruby and Java. \begin{center} -\begin{tabular}{@{\hspace{-1mm}}c@{\hspace{-1mm}}c@{}} +\begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1mm}}c@{}} +\begin{tikzpicture} +\begin{axis}[ + title={Graph: $\texttt{(a*)*\,b}$ and strings + $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$}, + xlabel={$n$}, + x label style={at={(1.05,0.0)}}, + ylabel={time in secs}, + enlargelimits=false, + xtick={0,5,...,30}, + xmax=33, + ymax=35, + ytick={0,5,...,30}, + scaled ticks=false, + axis lines=left, + width=5.5cm, + height=4.5cm, + legend entries={Python, Java}, + legend pos=north west, + legend cell align=left] +\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data}; +\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data}; +\end{axis} +\end{tikzpicture} +& \begin{tikzpicture} \begin{axis}[ title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings @@ -214,47 +240,24 @@ \addplot[brown,mark=triangle*, mark options={fill=white}] table {re-ruby.data}; \end{axis} \end{tikzpicture} -& -\begin{tikzpicture} -\begin{axis}[ - title={Graph: $\texttt{(a*)*\,b}$ and strings - $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$}, - xlabel={$n$}, - x label style={at={(1.05,0.0)}}, - ylabel={time in secs}, - enlargelimits=false, - xtick={0,5,...,30}, - xmax=33, - ymax=35, - ytick={0,5,...,30}, - scaled ticks=false, - axis lines=left, - width=5.5cm, - height=4.5cm, - legend entries={Python, Java}, - legend pos=north west, - legend cell align=left] -\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data}; -\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data}; -\end{axis} -\end{tikzpicture} \end{tabular} \end{center} -\noindent This first graph shows that Python needs approximately 29 -seconds for finding out whether a string of 28 \texttt{a}s matches the -regular expression \texttt{a?\{28\}\,a\{28\}}. Ruby is even slightly +\noindent This first graph shows that Python and Java need +approximately 30 seconds to find out that the regular expression +$\texttt{(a*)*\,b}$ does not match strings of 28 \texttt{a}s. +Similarly, the second shows that Python needs approximately 29 seconds +for finding out whether a string of 28 \texttt{a}s matches the regular +expression \texttt{a?\{28\}\,a\{28\}}. Ruby is even slightly worse.\footnote{In this example Ruby uses the slightly different regular expression \texttt{a?a?a?...a?a?aaa...aa}, where the \texttt{a?} and \texttt{a} each occur $n$ times. More such test cases can be found at \url{http://www.computerbytesman.com/redos/}.} -Simlarly, Python and Java needs approximately 30 seconds to find out -that the regular expression $\texttt{(a*)*\,b}$ does not match strings -of 28 \texttt{a}s. Admittedly, these regular expressions are -carefully chosen to exhibit this exponential behaviour, but similar -ones occur more often than one wants in ``real life''. For example, on -20 July 2016 a similar regular expression brought the webpage -\href{http://stackexchange.com}{Stack Exchange} to its knees: +Admittedly, these regular expressions are carefully chosen to exhibit +this exponential behaviour, but similar ones occur more often than one +wants in ``real life''. For example, on 20 July 2016 a similar regular +expression brought the webpage \href{http://stackexchange.com}{Stack + Exchange} to its knees: \begin{center} \url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016} @@ -293,9 +296,9 @@ at a relatively simple algorithm that solves this problem much better than Python and Ruby do\ldots actually it will be two versions of the algorithm: the first one will be able to -process strings of approximately 1,000 \texttt{a}s in 30 +process strings of approximately 1,100 \texttt{a}s in 23 seconds, while the second version will even be able to process -up to 7,000(!) in 30 seconds, see the graph below: +up to 11,000(!) in 5 seconds, see the graph below: \begin{center} \begin{tikzpicture} @@ -306,8 +309,8 @@ x label style={at={(1.05,0.0)}}, ylabel={time in secs}, enlargelimits=false, - xtick={0,3000,...,9000}, - xmax=10000, + xtick={0,3000,...,12000}, + xmax=13000, ymax=32, ytick={0,5,...,30}, scaled ticks=false, @@ -696,8 +699,9 @@ \end{quote} -\begin{figure}[p] -\lstinputlisting{../progs/crawler1.scala} +\begin{figure}[p]\small + \lstinputlisting[linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}] + {../progs/crawler1.scala} \caption{The Scala code for a simple web-crawler that checks for broken links in a web-page. It uses the regular expression @@ -708,8 +712,10 @@ \end{figure} -\begin{figure}[p] -\lstinputlisting{../progs/crawler2.scala} + +\begin{figure}[p]\small + \lstinputlisting[linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}] + {../progs/crawler2.scala} \caption{A version of the web-crawler that only follows links in ``my'' domain---since these are the ones I am interested in @@ -722,8 +728,9 @@ \end{figure} -\begin{figure}[p] -\lstinputlisting{../progs/crawler3.scala} +\begin{figure}[p]\small + \lstinputlisting[linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}] + {../progs/crawler3.scala} \caption{A small email harvester---whenever we download a web-page, we also check whether it contains any email