--- a/handouts/ho02.tex Wed Mar 15 01:24:39 2017 +0000
+++ b/handouts/ho02.tex Wed Mar 15 14:34:10 2017 +0000
@@ -13,16 +13,64 @@
\section*{Handout 2 (Regular Expression Matching)}
This lecture is about implementing a more efficient regular expression
-matcher (the plots on the right)---more efficient than the matchers
-from regular expression libraries in Ruby, Python and Java (the plots
-on the left). The first pair of plots show the running time for the
-regular expressions $a^?{}^{\{n\}}\cdot a^{\{n\}}$ and strings composed
-of $n$ \pcode{a}s. The second pair of plots show the running time
-for the regular expression $(a^*)^*\cdot b$ and also strings composed
-of $n$ \pcode{a}s (meaning this regular expression actually does not
+matcher (the plots on the right below)---more efficient than the
+matchers from regular expression libraries in Ruby, Python and Java
+(the plots on the left). The first pair of plots show the running time
+for the regular expression $(a^*)^*\cdot b$ and strings composed of
+$n$ \pcode{a}s (meaning this regular expression actually does not
+match the strings). The second pair of plots show the running time for
+the regular expressions $a^?{}^{\{n\}}\cdot a^{\{n\}}$ and strings
+also composed of $n$ \pcode{a}s (this time the regular expressions
match the strings). To see the substantial differences in the left
-and right plots below, note the different scales of the $x$-axes.
+and right plots below, note the different scales of the $x$-axes.
+
+\begin{center}
+Graphs: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$
+\begin{tabular}{@{}cc@{}}
+\begin{tikzpicture}
+\begin{axis}[
+ xlabel={$n$},
+ x label style={at={(1.05,0.0)}},
+ ylabel={time in secs},
+ enlargelimits=false,
+ xtick={0,5,...,30},
+ xmax=33,
+ ymax=35,
+ ytick={0,5,...,30},
+ scaled ticks=false,
+ axis lines=left,
+ width=5cm,
+ height=5cm,
+ legend entries={Java, Python},
+ legend pos=north west,
+ legend cell align=left]
+\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
+\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
+\end{axis}
+\end{tikzpicture}
+&
+\begin{tikzpicture}
+ \begin{axis}[
+ xlabel={$n$},
+ x label style={at={(1.05,0.0)}},
+ ylabel={time in secs},
+ enlargelimits=false,
+ ymax=35,
+ ytick={0,5,...,30},
+ axis lines=left,
+ scaled ticks=false,
+ width=6.5cm,
+ height=5cm,
+ legend entries={Scala V3},
+ legend pos=north east,
+ legend cell align=left]
+%\addplot[green,mark=square*,mark options={fill=white}] table {re2a.data};
+\addplot[black,mark=square*,mark options={fill=white}] table {re3a.data};
+\end{axis}
+\end{tikzpicture}
+\end{tabular}
+\end{center}
\begin{center}
Graphs: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$\\
@@ -62,62 +110,26 @@
scaled ticks=false,
axis lines=left,
width=6.5cm,
- height=5cm]
-\addplot[green,mark=square*,mark options={fill=white}] table {re2.data};
+ height=5cm,
+ legend entries={Scala V3},
+ legend pos=north east,
+ legend cell align=left]
+%\addplot[green,mark=square*,mark options={fill=white}] table {re2.data};
\addplot[black,mark=square*,mark options={fill=white}] table {re3.data};
\end{axis}
\end{tikzpicture}
\end{tabular}
\end{center}
-
-\begin{center}
-Graphs: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$
-\begin{tabular}{@{}cc@{}}
-\begin{tikzpicture}
-\begin{axis}[
- xlabel={$n$},
- x label style={at={(1.05,0.0)}},
- ylabel={time in secs},
- enlargelimits=false,
- xtick={0,5,...,30},
- xmax=33,
- ymax=35,
- ytick={0,5,...,30},
- scaled ticks=false,
- axis lines=left,
- width=5cm,
- height=5cm,
- legend entries={Java, Python},
- legend pos=north west,
- legend cell align=left]
-\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
-\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
-\end{axis}
-\end{tikzpicture}
-&
-\begin{tikzpicture}
- \begin{axis}[
- xlabel={$n$},
- x label style={at={(1.05,0.0)}},
- ylabel={time in secs},
- enlargelimits=false,
- ymax=35,
- ytick={0,5,...,30},
- axis lines=left,
- scaled ticks=false,
- width=6.5cm,
- height=5cm]
-\addplot[green,mark=square*,mark options={fill=white}] table {re2a.data};
-\addplot[black,mark=square*,mark options={fill=white}] table {re3a.data};
-\end{axis}
-\end{tikzpicture}
-\end{tabular}
-\end{center}\medskip
+\medskip
\noindent
-We will use these regular expressions and strings
-as running examples.
+We will use these regular expressions and strings as running
+examples. There will be several versions (V1, V2, V3,\ldots) of the
+algorithm.\footnote{The corresponding files are \texttt{re1.scala},
+ \texttt{re2.scala} and so on. As usual, you can find the code on
+ KEATS.}\bigskip
+\noindent
Having specified in the previous lecture what
problem our regular expression matcher is supposed to solve,
namely for any given regular expression $r$ and string $s$
@@ -571,11 +583,11 @@
\code{case class NTIMES(r: Rexp, n: Int) extends Rexp}
\end{center}
-\noindent With this fix we have a constant ``size'' regular
-expression for our running example no matter how large $n$ is.
-This means we have to also add cases for \pcode{NTIMES} in the
-functions $\textit{nullable}$ and $\textit{der}$. Does the change have any
-effect?
+\noindent With this fix we have a constant ``size'' regular expression
+for our running example no matter how large $n$ is (see the
+\texttt{size} section in the implementations). This means we have to
+also add cases for \pcode{NTIMES} in the functions $\textit{nullable}$
+and $\textit{der}$. Does the change have any effect?
\begin{center}
\begin{tikzpicture}
@@ -603,12 +615,12 @@
\end{tikzpicture}
\end{center}
-\noindent Now we are talking business! The modified matcher
-can within 25 seconds handle regular expressions up to
-$n = 1,100$ before a StackOverflow is raised. Recall that Python and Ruby
-(and our first version, Scala V1) could only handle $n = 27$ or so in 30
-seconds. There is no change for our second example
-$(a^*)^* \cdot b$---so this is still good.
+\noindent Now we are talking business! The modified matcher can within
+25 seconds handle regular expressions up to $n = 1,100$ before a
+StackOverflow is raised. Recall that Python and Ruby (and our first
+version, Scala V1) could only handle $n = 27$ or so in 30
+seconds. There is no change for our second example $(a^*)^* \cdot
+b$---so this is still good.
The moral is that our algorithm is rather sensitive to the
@@ -618,7 +630,7 @@
however, one more issue for making the algorithm run faster.
The derivative function often produces ``useless''
$\ZERO$s and $\ONE$s. To see this, consider $r = ((a
-\cdot b) + b)^*$ and the following two derivatives
+\cdot b) + b)^*$ and the following three derivatives
\begin{center}
\begin{tabular}{l}
@@ -642,18 +654,18 @@
\end{center}
\noindent I leave it to you to contemplate whether such a
-simplification can have any impact on the correctness of our
-algorithm (will it change any answers?). Figure~\ref{scala2}
-gives a simplification function that recursively traverses a
-regular expression and simplifies it according to the rules
-given at the beginning. There are only rules for $+$, $\cdot$
-and $n$-times (the latter because we added it in the second
-version of our matcher). There is no rule for a star, because
-empirical data and also a little thought showed that
-simplifying under a star is a waste of computation time. The
-simplification function will be called after every derivation.
-This additional step removes all the ``junk'' the derivative
-function introduced. Does this improve the speed? You bet!!
+simplification can have any impact on the correctness of our algorithm
+(will it change any answers?). Figure~\ref{scala2} gives a
+simplification function that recursively traverses a regular
+expression and simplifies it according to the rules given at the
+beginning. There are only rules for $+$, $\cdot$ and $n$-times (the
+latter because we added it in the second version of our
+matcher). There is no simplification rule for a star, because
+empirical data and also a little thought showed that simplifying under
+a star is a waste of computation time. The simplification function
+will be called after every derivation. This additional step removes
+all the ``junk'' the derivative function introduced. Does this improve
+the speed? You bet!!
\begin{figure}[p]
\lstinputlisting[numbers=left,linebackgroundcolor=
@@ -675,8 +687,8 @@
x label style={at={(1.04,0.0)}},
ylabel={time in secs},
enlargelimits=false,
- xtick={0,3000,...,9000},
- xmax=10000,
+ xtick={0,2500,...,10000},
+ xmax=12000,
ytick={0,5,...,30},
ymax=32,
scaled ticks=false,
@@ -693,12 +705,13 @@
\end{center}
\noindent
-To reacap, Python and Ruby needed approximately 30 seconds to match
-a string of 28 \texttt{a}s and the regular expression $a^{?\{n\}} \cdot a^{\{n\}}$.
-We need a third of this time to do the same with strings up to 12,000 \texttt{a}s.
-Similarly, Java needed 30 seconds to find out the regular expression
-$(a^*)^* \cdot b$ does not match the string of 28 \texttt{a}s. We can do
-the same in approximately 5 seconds for strings of 6000000 \texttt{a}s:
+To reacap, Python and Ruby needed approximately 30 seconds to match a
+string of 28 \texttt{a}s and the regular expression $a^{?\{n\}} \cdot
+a^{\{n\}}$. We need a third of this time to do the same with strings
+up to 11,000 \texttt{a}s. Similarly, Java and Python needed 30
+seconds to find out the regular expression $(a^*)^* \cdot b$ does not
+match the string of 28 \texttt{a}s. We can do the same in
+for strings of 6,000,000 \texttt{a}s:
\begin{center}
@@ -706,20 +719,20 @@
\begin{axis}[
title={Graph: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$},
xlabel={$n$},
- x label style={at={(1.09,0.0)}},
ylabel={time in secs},
enlargelimits=false,
- xmax=7700000,
+ ymax=35,
ytick={0,5,...,30},
- ymax=32,
- %scaled ticks=false,
axis lines=left,
+ scaled ticks=false,
+ x label style={at={(1.09,0.0)}},
+ %xmax=7700000,
width=9cm,
height=5cm,
- legend entries={Scala V2, Scala V3},
+ legend entries={Scala V3},
legend pos=outer north east,
legend cell align=left]
-\addplot[green,mark=square*,mark options={fill=white}] table {re2a.data};
+%\addplot[green,mark=square*,mark options={fill=white}] table {re2a.data};
\addplot[black,mark=square*,mark options={fill=white}] table {re3a.data};
\end{axis}
\end{tikzpicture}
@@ -728,7 +741,7 @@
\subsection*{Epilogue}
(23/Aug/2016) I recently found another place where this algorithm can be
-sped (this idea is not integrated with what is coming next,
+sped up (this idea is not integrated with what is coming next,
but I present it nonetheless). The idea is to define \texttt{ders}
not such that it iterates the derivative character-by-character, but
in bigger chunks. The resulting code for \texttt{ders2} looks as