afl-material: comparison handouts/ho02.tex

equal deleted inserted replaced

-:24531cfaa36a
+:ee4304bc6350
 \usepackage{../style}
 \usepackage{../langs}
 \usepackage{../graphics}
 \usepackage{../data}
 \begin{document}
 \section*{Handout 2}
 This lecture is about implementing a more efficient regular
 expression matcher (the plots on the right)---more efficient
 than the matchers from regular expression libraries in Ruby and
 Python (the plots on the left). These plots show the running
 time for the evil regular expression $a?^{\{n\}}a^{\{n\}}$.
+Note the different scales in each plot.
+\pgfplotsset{compat=1.11}
 \begin{center}
 \begin{tabular}{@{}cc@{}}
-\begin{tikzpicture}[y=.072cm, x=.12cm]
+\begin{tikzpicture}
-	%axis
+\begin{axis}[
-	\draw (0,0) -- coordinate (x axis mid) (30,0);
+xlabel={\pcode{a}s},
-\draw (0,0) -- coordinate (y axis mid) (0,30);
+ylabel={time in secs},
-%ticks
+enlargelimits=false,
-\foreach \x in {0,5,...,30}
+xtick={0,5,...,30},
-\draw (\x,1pt) -- (\x,-3pt) node[anchor=north] {\x};
+xmax=30,
-\foreach \y in {0,5,...,30}
+ymax=35,
-\draw (1pt,\y) -- (-3pt,\y) node[anchor=east] {\y};
+ytick={0,5,...,30},
-	%labels
+scaled ticks=false,
-	\node[below=0.6cm] at (x axis mid) {number of \texttt{a}s};
+axis lines=left,
-	\node[rotate=90,left=0.9cm] at (y axis mid) {time in secs};
+width=5cm,
-	%plots
+height=5cm,
-	\draw[color=blue] plot[mark=*]
+legend entries={Python,Ruby},
-		file {re-python.data};
+legend pos=north west,
-	\draw[color=brown] plot[mark=triangle*]
+legend cell align=left
-		file {re-ruby.data};
+]
-%legend
+\addplot[blue,mark=*, mark options={fill=white}]
-	\begin{scope}[shift={(4,20)}]
+table {re-python.data};
-	\draw[color=blue] (0,0) --
+\addplot[brown,mark=pentagon*, mark options={fill=white}]
-		plot[mark=*] (0.25,0) -- (0.5,0)
+table {re-ruby.data};
-		node[right]{\small Python};
+\end{axis}
-	\draw[yshift=-4mm, color=brown] (0,0) --
+\end{tikzpicture}
-		plot[mark=triangle*] (0.25,0) -- (0.5,0)
-		node[right]{\small Ruby};
-	\end{scope}
-\end{tikzpicture}
 &
-\begin{tikzpicture}[y=.072cm, x=.0004cm]
+\begin{tikzpicture}
-	%axis
+\begin{axis}[
-	\draw (0,0) -- coordinate (x axis mid) (12000,0);
+xlabel={\pcode{a}s},
-\draw (0,0) -- coordinate (y axis mid) (0,30);
+ylabel={time in secs},
-%ticks
+enlargelimits=false,
-\foreach \x in {0,3000,...,12000}
+xtick={0,3000,...,12000},
-	\draw (\x,1pt) -- (\x,-3pt) node[anchor=north] {\x};
+xmax=12000,
-\foreach \y in {0,5,...,30}
+ymax=35,
-	\draw (1pt,\y) -- (-3pt,\y) node[anchor=east] {\y};
+ytick={0,5,...,30},
-	%labels
+scaled ticks=false,
-	\node[below=0.6cm] at (x axis mid) {number of \texttt{a}s};
+axis lines=left,
-	\node[rotate=90,left=0.9cm] at (y axis mid) {time in secs};
+width=6.5cm,
+height=5cm
-	%plots
+]
-\draw[color=green] plot[mark=square*, mark options={fill=white} ]
+\addplot[green,mark=square*,mark options={fill=white}] table {re2b.data};
-		file {re2b.data};
+\addplot[black,mark=square*,mark options={fill=white}] table {re3.data};
-	\draw[color=black] plot[mark=square*, mark options={fill=white} ]
+\end{axis}
-		file {re3.data};
 \end{tikzpicture}
 \end{tabular}
 \end{center}\medskip
 \lstinputlisting[numbers=none]{../progs/app51.scala}
 \noindent Running the matcher with the example, we find it is
 slightly worse then the matcher in Ruby and Python.
-Ooops\ldots\medskip
+Ooops\ldots
+\pgfplotsset{compat=1.11}
+\begin{center}
+\begin{tikzpicture}
+\begin{axis}[
+xlabel={\pcode{a}s},
+ylabel={time in secs},
+enlargelimits=false,
+xtick={0,5,...,30},
+xmax=30,
+ytick={0,5,...,30},
+scaled ticks=false,
+axis lines=left,
+width=6cm,
+height=5cm,
+legend entries={Python,Ruby,Scala V1},
+legend pos=outer north east,
+legend cell align=left
+]
+\addplot[blue,mark=*, mark options={fill=white}]
+table {re-python.data};
+\addplot[brown,mark=pentagon*, mark options={fill=white}]
+table {re-ruby.data};
+\addplot[red,mark=triangle*,mark options={fill=white}]
+table {re1.data};
+\end{axis}
+\end{tikzpicture}
+\end{center}
 \noindent Analysing this failure a bit we notice that
 for $a^{\{n\}}$ we generate quite big regular expressions:
 \begin{center}
 1: & $a$\\
 2: & $a\cdot a$\\
 3: & $a\cdot a\cdot a$\\
 & \ldots\\
 13: & $a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a$\\
-& \ldots\\
+& \ldots
-20:
 \end{tabular}
 \end{center}
 \noindent Our algorithm traverses such regular expressions at
 least once every time a derivative is calculated. So having
-large regular expressions, will cause problems. This problem
+large regular expressions will cause problems. This problem
-is aggravated with $a?$ being represented as $a + \epsilon$.
+is aggravated by $a?$ being represented as $a + \epsilon$.
+We can fix this by having an explicit constructor for
+$r^{\{n\}}$. In Scala we would introduce a constructor like
+\begin{center}
+\code{case class NTIMES(r: Rexp, n: Int) extends Rexp}
+\end{center}
+\noindent With this we have a constant ``size'' regular
+expression for our running example no matter how large $n$
+is. This means we have to also add cases for $nullable$ and
+$der$. Does the change have any effect?
+\pgfplotsset{compat=1.11}
+\begin{center}
+\begin{tikzpicture}
+\begin{axis}[
+xlabel={\pcode{a}s},
+ylabel={time in secs},
+enlargelimits=false,
+xtick={0,100,...,1000},
+xmax=1000,
+ytick={0,5,...,30},
+scaled ticks=false,
+axis lines=left,
+width=10cm,
+height=5cm,
+legend entries={Python,Ruby,Scala V1,Scala V2},
+legend pos=outer north east,
+legend cell align=left
+]
+\addplot[blue,mark=*, mark options={fill=white}]
+table {re-python.data};
+\addplot[brown,mark=pentagon*, mark options={fill=white}]
+table {re-ruby.data};
+\addplot[red,mark=triangle*,mark options={fill=white}]
+table {re1.data};
+\addplot[green,mark=square*,mark options={fill=white}]
+table {re2b.data};
+\end{axis}
+\end{tikzpicture}
+\end{center}
+\noindent Now we are talking business! The modified matcher
+can within 30 seconds handle regular expressions up to
+$n = 950$ before a StackOverflow is raised.
+The moral is that our algorithm is rather sensitive to the
+size of regular expressions it needs to handle. This is of
+course obvious because both $nullable$ and $der$ need to
+traverse the whole regular expression. There seems to be one
+more source of making the algorithm run faster. The derivative
+function often produces ``useless'' $\varnothing$s and
+$\epsilon$s. To see this, consider $r = ((a \cdot b) + b)^*$
+and the following two derivatives
+\begin{center}
+\begin{tabular}{l}
+$der\,a\,r = ((\epsilon \cdot b) + \varnothing) \cdot r$\\
+$der\,b\,r = ((\varnothing \cdot b) + \epsilon)\cdot r$\\
+$der\,c\,r = ((\varnothing \cdot b) + \varnothing)\cdot r$
+\end{tabular}
+\end{center}
+\noindent
+If we simplify them according to the simple rules from the
+beginning, we can replace the right-hand sides by the
+smaller equivalent regular expressions
+\begin{center}
+\begin{tabular}{l}
+$der\,a\,r \equiv b \cdot r$\\
+$der\,b\,r \equiv r$\\
+$der\,c\,r \equiv \varnothing$
+\end{tabular}
+\end{center}
+\noindent I leave it to you to contemplate whether such a
+simplification can have any impact on the correctness of our
+algorithm (will it change any answers?). Figure~\ref{scala2}
+give a simplification function that recursively traverses a
+regular expression and simplifies it according to the rules
+given at the beginning. There are only rules for $+$, $\cdot$
+and $n$-times (the latter because we added it in the second
+version of our matcher). There is no rule for a star, because
+empirical data and also a little thought showed that
+simplifying under a star is waste of computation time. The
+simplification function will be called after every derivation.
+This additional step removes all the ``junk'' the derivative
+function introduced. Does this improve the speed? You bet!!
+\begin{figure}[p]
+\lstinputlisting{../progs/app6.scala}
+\caption{The simplification function and modified
+\pcode{ders}-function.\label{scala2}}
+\end{figure}
+\pgfplotsset{compat=1.11}
+\begin{center}
+\begin{tikzpicture}
+\begin{axis}[
+xlabel={\pcode{a}s},
+ylabel={time in secs},
+enlargelimits=false,
+xtick={0,2000,...,12000},
+xmax=12000,
+ytick={0,5,...,30},
+scaled ticks=false,
+axis lines=left,
+width=9cm,
+height=4cm,
+legend entries={Scala V2,Scala V3},
+]
+\addplot[green,mark=square*,mark options={fill=white}] table {re2b.data};
+\addplot[black,mark=square*,mark options={fill=white}] table {re3.data};
+\end{axis}
+\end{tikzpicture}
+\end{center}
 \end{document}
 %%% Local Variables:
 %%% mode: latex
 %%% TeX-master: t
 %%% End:

changeset 262	ee4304bc6350
parent 261	24531cfaa36a
child 263	92e6985018ae