handouts/ho02.tex
changeset 478 48b842c997c7
parent 477 b78664a24f5d
child 479 52aa298211f6
child 480 9e42ccbbd1e6
--- a/handouts/ho02.tex	Wed Mar 15 01:24:39 2017 +0000
+++ b/handouts/ho02.tex	Wed Mar 15 14:34:10 2017 +0000
@@ -13,16 +13,64 @@
 \section*{Handout 2 (Regular Expression Matching)}
 
 This lecture is about implementing a more efficient regular expression
-matcher (the plots on the right)---more efficient than the matchers
-from regular expression libraries in Ruby, Python and Java (the plots
-on the left). The first pair of plots show the running time for the
-regular expressions $a^?{}^{\{n\}}\cdot a^{\{n\}}$ and strings composed
-of $n$ \pcode{a}s. The second pair of plots show the running time
-for the regular expression $(a^*)^*\cdot b$ and also strings composed
-of $n$ \pcode{a}s (meaning this regular expression actually does not
+matcher (the plots on the right below)---more efficient than the
+matchers from regular expression libraries in Ruby, Python and Java
+(the plots on the left). The first pair of plots show the running time
+for the regular expression $(a^*)^*\cdot b$ and strings composed of
+$n$ \pcode{a}s (meaning this regular expression actually does not
+match the strings). The second pair of plots show the running time for
+the regular expressions $a^?{}^{\{n\}}\cdot a^{\{n\}}$ and strings
+also composed of $n$ \pcode{a}s (this time the regular expressions
 match the strings).  To see the substantial differences in the left
-and right plots below, note the different scales of the $x$-axes. 
+and right plots below, note the different scales of the $x$-axes.
+
 
+\begin{center}
+Graphs: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$
+\begin{tabular}{@{}cc@{}}
+\begin{tikzpicture}
+\begin{axis}[
+    xlabel={$n$},
+    x label style={at={(1.05,0.0)}},
+    ylabel={time in secs},
+    enlargelimits=false,
+    xtick={0,5,...,30},
+    xmax=33,
+    ymax=35,
+    ytick={0,5,...,30},
+    scaled ticks=false,
+    axis lines=left,
+    width=5cm,
+    height=5cm, 
+    legend entries={Java, Python},  
+    legend pos=north west,
+    legend cell align=left]
+\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
+\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
+\end{axis}
+\end{tikzpicture}
+&
+\begin{tikzpicture}
+  \begin{axis}[
+    xlabel={$n$},
+    x label style={at={(1.05,0.0)}},
+    ylabel={time in secs},
+    enlargelimits=false,
+    ymax=35,
+    ytick={0,5,...,30},
+    axis lines=left,
+    scaled ticks=false,
+    width=6.5cm,
+    height=5cm,
+    legend entries={Scala V3},  
+    legend pos=north east,
+    legend cell align=left]
+%\addplot[green,mark=square*,mark options={fill=white}] table {re2a.data};    
+\addplot[black,mark=square*,mark options={fill=white}] table {re3a.data};
+\end{axis}
+\end{tikzpicture}
+\end{tabular}
+\end{center}
 
 \begin{center}
 Graphs: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$\\
@@ -62,62 +110,26 @@
     scaled ticks=false,
     axis lines=left,
     width=6.5cm,
-    height=5cm]
-\addplot[green,mark=square*,mark options={fill=white}] table {re2.data};
+    height=5cm,
+    legend entries={Scala V3},  
+    legend pos=north east,
+    legend cell align=left]
+%\addplot[green,mark=square*,mark options={fill=white}] table {re2.data};
 \addplot[black,mark=square*,mark options={fill=white}] table {re3.data};
 \end{axis}
 \end{tikzpicture}
 \end{tabular}
 \end{center}
-
-\begin{center}
-Graphs: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$
-\begin{tabular}{@{}cc@{}}
-\begin{tikzpicture}
-\begin{axis}[
-    xlabel={$n$},
-    x label style={at={(1.05,0.0)}},
-    ylabel={time in secs},
-    enlargelimits=false,
-    xtick={0,5,...,30},
-    xmax=33,
-    ymax=35,
-    ytick={0,5,...,30},
-    scaled ticks=false,
-    axis lines=left,
-    width=5cm,
-    height=5cm, 
-    legend entries={Java, Python},  
-    legend pos=north west,
-    legend cell align=left]
-\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
-\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
-\end{axis}
-\end{tikzpicture}
-&
-\begin{tikzpicture}
-  \begin{axis}[
-    xlabel={$n$},
-    x label style={at={(1.05,0.0)}},
-    ylabel={time in secs},
-    enlargelimits=false,
-    ymax=35,
-    ytick={0,5,...,30},
-    axis lines=left,
-    scaled ticks=false,
-    width=6.5cm,
-    height=5cm]
-\addplot[green,mark=square*,mark options={fill=white}] table {re2a.data};    
-\addplot[black,mark=square*,mark options={fill=white}] table {re3a.data};
-\end{axis}
-\end{tikzpicture}
-\end{tabular}
-\end{center}\medskip
+\medskip
 
 \noindent
-We will use these regular expressions and strings
-as running examples.
+We will use these regular expressions and strings as running
+examples. There will be several versions (V1, V2, V3,\ldots) of the
+algorithm.\footnote{The corresponding files are \texttt{re1.scala},
+  \texttt{re2.scala} and so on. As usual, you can find the code on
+  KEATS.}\bigskip
 
+\noindent
 Having specified in the previous lecture what
 problem our regular expression matcher is supposed to solve,
 namely for any given regular expression $r$ and string $s$
@@ -571,11 +583,11 @@
 \code{case class NTIMES(r: Rexp, n: Int) extends Rexp}
 \end{center}
 
-\noindent With this fix we have a constant ``size'' regular
-expression for our running example no matter how large $n$ is.
-This means we have to also add cases for \pcode{NTIMES} in the
-functions $\textit{nullable}$ and $\textit{der}$. Does the change have any
-effect?
+\noindent With this fix we have a constant ``size'' regular expression
+for our running example no matter how large $n$ is (see the
+\texttt{size} section in the implementations).  This means we have to
+also add cases for \pcode{NTIMES} in the functions $\textit{nullable}$
+and $\textit{der}$. Does the change have any effect?
 
 \begin{center}
 \begin{tikzpicture}
@@ -603,12 +615,12 @@
 \end{tikzpicture}
 \end{center}
 
-\noindent Now we are talking business! The modified matcher 
-can within 25 seconds handle regular expressions up to
-$n = 1,100$ before a StackOverflow is raised. Recall that Python and Ruby 
-(and our first version, Scala V1) could only handle $n = 27$ or so in 30 
-seconds. There is no change for our second example 
-$(a^*)^* \cdot b$---so this is still good.
+\noindent Now we are talking business! The modified matcher can within
+25 seconds handle regular expressions up to $n = 1,100$ before a
+StackOverflow is raised. Recall that Python and Ruby (and our first
+version, Scala V1) could only handle $n = 27$ or so in 30
+seconds. There is no change for our second example $(a^*)^* \cdot
+b$---so this is still good.
 
 
 The moral is that our algorithm is rather sensitive to the
@@ -618,7 +630,7 @@
 however, one more issue for making the algorithm run faster.
 The derivative function often produces ``useless''
 $\ZERO$s and $\ONE$s. To see this, consider $r = ((a
-\cdot b) + b)^*$ and the following two derivatives
+\cdot b) + b)^*$ and the following three derivatives
 
 \begin{center}
 \begin{tabular}{l}
@@ -642,18 +654,18 @@
 \end{center}
 
 \noindent I leave it to you to contemplate whether such a
-simplification can have any impact on the correctness of our
-algorithm (will it change any answers?). Figure~\ref{scala2}
-gives a simplification function that recursively traverses a
-regular expression and simplifies it according to the rules
-given at the beginning. There are only rules for $+$, $\cdot$
-and $n$-times (the latter because we added it in the second
-version of our matcher). There is no rule for a star, because
-empirical data and also a little thought showed that
-simplifying under a star is a waste of computation time. The
-simplification function will be called after every derivation.
-This additional step removes all the ``junk'' the derivative
-function introduced. Does this improve the speed? You bet!!
+simplification can have any impact on the correctness of our algorithm
+(will it change any answers?). Figure~\ref{scala2} gives a
+simplification function that recursively traverses a regular
+expression and simplifies it according to the rules given at the
+beginning. There are only rules for $+$, $\cdot$ and $n$-times (the
+latter because we added it in the second version of our
+matcher). There is no simplification rule for a star, because
+empirical data and also a little thought showed that simplifying under
+a star is a waste of computation time. The simplification function
+will be called after every derivation.  This additional step removes
+all the ``junk'' the derivative function introduced. Does this improve
+the speed? You bet!!
 
 \begin{figure}[p]
 \lstinputlisting[numbers=left,linebackgroundcolor=
@@ -675,8 +687,8 @@
     x label style={at={(1.04,0.0)}},
     ylabel={time in secs},
     enlargelimits=false,
-    xtick={0,3000,...,9000},
-    xmax=10000,
+    xtick={0,2500,...,10000},
+    xmax=12000,
     ytick={0,5,...,30},
     ymax=32,
     scaled ticks=false,
@@ -693,12 +705,13 @@
 \end{center}
 
 \noindent
-To reacap, Python and Ruby needed approximately 30 seconds to match
-a string of 28 \texttt{a}s and the regular expression $a^{?\{n\}} \cdot a^{\{n\}}$. 
-We need a third of this time to do the same with strings up to 12,000 \texttt{a}s. 
-Similarly, Java needed 30 seconds to find out the regular expression 
-$(a^*)^* \cdot b$ does not match the string of 28 \texttt{a}s. We can do
-the same in approximately 5 seconds for strings of 6000000 \texttt{a}s:
+To reacap, Python and Ruby needed approximately 30 seconds to match a
+string of 28 \texttt{a}s and the regular expression $a^{?\{n\}} \cdot
+a^{\{n\}}$.  We need a third of this time to do the same with strings
+up to 11,000 \texttt{a}s.  Similarly, Java and Python needed 30
+seconds to find out the regular expression $(a^*)^* \cdot b$ does not
+match the string of 28 \texttt{a}s. We can do the same in
+for strings of 6,000,000 \texttt{a}s:
 
 
 \begin{center}
@@ -706,20 +719,20 @@
 \begin{axis}[
     title={Graph: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$},
     xlabel={$n$},
-    x label style={at={(1.09,0.0)}},
     ylabel={time in secs},
     enlargelimits=false,
-    xmax=7700000,
+    ymax=35,
     ytick={0,5,...,30},
-    ymax=32,
-    %scaled ticks=false,
     axis lines=left,
+    scaled ticks=false,
+    x label style={at={(1.09,0.0)}},
+    %xmax=7700000,
     width=9cm,
     height=5cm, 
-    legend entries={Scala V2, Scala V3},
+    legend entries={Scala V3},
     legend pos=outer north east,
     legend cell align=left]
-\addplot[green,mark=square*,mark options={fill=white}] table {re2a.data};
+%\addplot[green,mark=square*,mark options={fill=white}] table {re2a.data};
 \addplot[black,mark=square*,mark options={fill=white}] table {re3a.data};
 \end{axis}
 \end{tikzpicture}
@@ -728,7 +741,7 @@
 \subsection*{Epilogue}
 
 (23/Aug/2016) I recently found another place where this algorithm can be 
-sped (this idea is not integrated with what is coming next,
+sped up (this idea is not integrated with what is coming next,
 but I present it nonetheless). The idea is to define \texttt{ders}
 not such that it iterates the derivative character-by-character, but
 in bigger chunks. The resulting code for \texttt{ders2} looks as