afl-material: comparison handouts/ho02.tex

equal deleted inserted replaced

-:28e872e7efb3
+:dfa7b4ca199f
 \section*{Handout 2 (Regular Expression Matching)}
 This lecture is about implementing a more efficient regular expression
-matcher (the plots on the right)---more efficient than the matchers
+matcher (the plots on the right below)---more efficient than the
-from regular expression libraries in Ruby, Python and Java (the plots
+matchers from regular expression libraries in Ruby, Python and Java
-on the left). The first pair of plots show the running time for the
+(the plots on the left). The first pair of plots show the running time
-regular expressions $a^?{}^{\{n\}}\cdot a^{\{n\}}$ and strings composed
+for the regular expression $(a^*)^*\cdot b$ and strings composed of
-of $n$ \pcode{a}s. The second pair of plots show the running time
+$n$ \pcode{a}s (meaning this regular expression actually does not
-for the regular expression $(a^*)^*\cdot b$ and also strings composed
+match the strings). The second pair of plots show the running time for
-of $n$ \pcode{a}s (meaning this regular expression actually does not
+the regular expressions $a^?{}^{\{n\}}\cdot a^{\{n\}}$ and strings
+also composed of $n$ \pcode{a}s (this time the regular expressions
 match the strings).  To see the substantial differences in the left
 and right plots below, note the different scales of the $x$-axes.
+\begin{center}
+Graphs: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$
+\begin{tabular}{@{}cc@{}}
+\begin{tikzpicture}
+\begin{axis}[
+xlabel={$n$},
+x label style={at={(1.05,0.0)}},
+ylabel={time in secs},
+enlargelimits=false,
+xtick={0,5,...,30},
+xmax=33,
+ymax=35,
+ytick={0,5,...,30},
+scaled ticks=false,
+axis lines=left,
+width=5cm,
+height=5cm,
+legend entries={Java, Python},
+legend pos=north west,
+legend cell align=left]
+\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
+\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
+\end{axis}
+\end{tikzpicture}
+&
+\begin{tikzpicture}
+\begin{axis}[
+xlabel={$n$},
+x label style={at={(1.05,0.0)}},
+ylabel={time in secs},
+enlargelimits=false,
+ymax=35,
+ytick={0,5,...,30},
+axis lines=left,
+scaled ticks=false,
+width=6.5cm,
+height=5cm,
+legend entries={Scala V3},
+legend pos=north east,
+legend cell align=left]
+%\addplot[green,mark=square*,mark options={fill=white}] table {re2a.data};
+\addplot[black,mark=square*,mark options={fill=white}] table {re3a.data};
+\end{axis}
+\end{tikzpicture}
+\end{tabular}
+\end{center}
 \begin{center}
 Graphs: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$\\
 \begin{tabular}{@{}cc@{}}
 \begin{tikzpicture}
 ymax=35,
 ytick={0,5,...,30},
 scaled ticks=false,
 axis lines=left,
 width=6.5cm,
-height=5cm]
+height=5cm,
-\addplot[green,mark=square*,mark options={fill=white}] table {re2.data};
+legend entries={Scala V3},
+legend pos=north east,
+legend cell align=left]
+%\addplot[green,mark=square*,mark options={fill=white}] table {re2.data};
 \addplot[black,mark=square*,mark options={fill=white}] table {re3.data};
 \end{axis}
 \end{tikzpicture}
 \end{tabular}
 \end{center}
+\medskip
-\begin{center}
-Graphs: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$
-\begin{tabular}{@{}cc@{}}
-\begin{tikzpicture}
-\begin{axis}[
-xlabel={$n$},
-x label style={at={(1.05,0.0)}},
-ylabel={time in secs},
-enlargelimits=false,
-xtick={0,5,...,30},
-xmax=33,
-ymax=35,
-ytick={0,5,...,30},
-scaled ticks=false,
-axis lines=left,
-width=5cm,
-height=5cm,
-legend entries={Java, Python},
-legend pos=north west,
-legend cell align=left]
-\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
-\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
-\end{axis}
-\end{tikzpicture}
-&
-\begin{tikzpicture}
-\begin{axis}[
-xlabel={$n$},
-x label style={at={(1.05,0.0)}},
-ylabel={time in secs},
-enlargelimits=false,
-ymax=35,
-ytick={0,5,...,30},
-axis lines=left,
-scaled ticks=false,
-width=6.5cm,
-height=5cm]
-\addplot[green,mark=square*,mark options={fill=white}] table {re2a.data};
-\addplot[black,mark=square*,mark options={fill=white}] table {re3a.data};
-\end{axis}
-\end{tikzpicture}
-\end{tabular}
-\end{center}\medskip
 \noindent
-We will use these regular expressions and strings
+We will use these regular expressions and strings as running
-as running examples.
+examples. There will be several versions (V1, V2, V3,\ldots) of the
+algorithm.\footnote{The corresponding files are \texttt{re1.scala},
+\texttt{re2.scala} and so on. As usual, you can find the code on
+KEATS.}\bigskip
+\noindent
 Having specified in the previous lecture what
 problem our regular expression matcher is supposed to solve,
 namely for any given regular expression $r$ and string $s$
 answer \textit{true} if and only if
 \begin{center}
 \code{case class NTIMES(r: Rexp, n: Int) extends Rexp}
 \end{center}
-\noindent With this fix we have a constant ``size'' regular
+\noindent With this fix we have a constant ``size'' regular expression
-expression for our running example no matter how large $n$ is.
+for our running example no matter how large $n$ is (see the
-This means we have to also add cases for \pcode{NTIMES} in the
+\texttt{size} section in the implementations).  This means we have to
-functions $\textit{nullable}$ and $\textit{der}$. Does the change have any
+also add cases for \pcode{NTIMES} in the functions $\textit{nullable}$
-effect?
+and $\textit{der}$. Does the change have any effect?
 \begin{center}
 \begin{tikzpicture}
 \begin{axis}[
 title={Graph: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$},
 \addplot[green,mark=square*,mark options={fill=white}] table {re2.data};
 \end{axis}
 \end{tikzpicture}
 \end{center}
-\noindent Now we are talking business! The modified matcher
+\noindent Now we are talking business! The modified matcher can within
-can within 25 seconds handle regular expressions up to
+25 seconds handle regular expressions up to $n = 1,100$ before a
-$n = 1,100$ before a StackOverflow is raised. Recall that Python and Ruby
+StackOverflow is raised. Recall that Python and Ruby (and our first
-(and our first version, Scala V1) could only handle $n = 27$ or so in 30
+version, Scala V1) could only handle $n = 27$ or so in 30
-seconds. There is no change for our second example
+seconds. There is no change for our second example $(a^*)^* \cdot
-$(a^*)^* \cdot b$---so this is still good.
+b$---so this is still good.
 The moral is that our algorithm is rather sensitive to the
 size of regular expressions it needs to handle. This is of
 course obvious because both $\textit{nullable}$ and $\textit{der}$ frequently
 need to traverse the whole regular expression. There seems,
 however, one more issue for making the algorithm run faster.
 The derivative function often produces ``useless''
 $\ZERO$s and $\ONE$s. To see this, consider $r = ((a
-\cdot b) + b)^*$ and the following two derivatives
+\cdot b) + b)^*$ and the following three derivatives
 \begin{center}
 \begin{tabular}{l}
 $\textit{der}\,a\,r = ((\ONE \cdot b) + \ZERO) \cdot r$\\
 $\textit{der}\,b\,r = ((\ZERO \cdot b) + \ONE)\cdot r$\\
 $\textit{der}\,c\,r \equiv \ZERO$
 \end{tabular}
 \end{center}
 \noindent I leave it to you to contemplate whether such a
-simplification can have any impact on the correctness of our
+simplification can have any impact on the correctness of our algorithm
-algorithm (will it change any answers?). Figure~\ref{scala2}
+(will it change any answers?). Figure~\ref{scala2} gives a
-gives a simplification function that recursively traverses a
+simplification function that recursively traverses a regular
-regular expression and simplifies it according to the rules
+expression and simplifies it according to the rules given at the
-given at the beginning. There are only rules for $+$, $\cdot$
+beginning. There are only rules for $+$, $\cdot$ and $n$-times (the
-and $n$-times (the latter because we added it in the second
+latter because we added it in the second version of our
-version of our matcher). There is no rule for a star, because
+matcher). There is no simplification rule for a star, because
-empirical data and also a little thought showed that
+empirical data and also a little thought showed that simplifying under
-simplifying under a star is a waste of computation time. The
+a star is a waste of computation time. The simplification function
-simplification function will be called after every derivation.
+will be called after every derivation.  This additional step removes
-This additional step removes all the ``junk'' the derivative
+all the ``junk'' the derivative function introduced. Does this improve
-function introduced. Does this improve the speed? You bet!!
+the speed? You bet!!
 \begin{figure}[p]
 \lstinputlisting[numbers=left,linebackgroundcolor=
 {\ifodd\value{lstnumber}\color{capri!3}\fi}]
 {../progs/app6.scala}
 title={Graph: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$},
 xlabel={$n$},
 x label style={at={(1.04,0.0)}},
 ylabel={time in secs},
 enlargelimits=false,
-xtick={0,3000,...,9000},
+xtick={0,2500,...,10000},
-xmax=10000,
+xmax=12000,
 ytick={0,5,...,30},
 ymax=32,
 scaled ticks=false,
 axis lines=left,
 width=9cm,
 \end{axis}
 \end{tikzpicture}
 \end{center}
 \noindent
-To reacap, Python and Ruby needed approximately 30 seconds to match
+To reacap, Python and Ruby needed approximately 30 seconds to match a
-a string of 28 \texttt{a}s and the regular expression $a^{?\{n\}} \cdot a^{\{n\}}$.
+string of 28 \texttt{a}s and the regular expression $a^{?\{n\}} \cdot
-We need a third of this time to do the same with strings up to 12,000 \texttt{a}s.
+a^{\{n\}}$.  We need a third of this time to do the same with strings
-Similarly, Java needed 30 seconds to find out the regular expression
+up to 11,000 \texttt{a}s.  Similarly, Java and Python needed 30
-$(a^*)^* \cdot b$ does not match the string of 28 \texttt{a}s. We can do
+seconds to find out the regular expression $(a^*)^* \cdot b$ does not
-the same in approximately 5 seconds for strings of 6000000 \texttt{a}s:
+match the string of 28 \texttt{a}s. We can do the same in
+for strings of 6,000,000 \texttt{a}s:
 \begin{center}
 \begin{tikzpicture}
 \begin{axis}[
 title={Graph: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$},
 xlabel={$n$},
-x label style={at={(1.09,0.0)}},
 ylabel={time in secs},
 enlargelimits=false,
-xmax=7700000,
+ymax=35,
 ytick={0,5,...,30},
-ymax=32,
-%scaled ticks=false,
 axis lines=left,
+scaled ticks=false,
+x label style={at={(1.09,0.0)}},
+%xmax=7700000,
 width=9cm,
 height=5cm,
-legend entries={Scala V2, Scala V3},
+legend entries={Scala V3},
 legend pos=outer north east,
 legend cell align=left]
-\addplot[green,mark=square*,mark options={fill=white}] table {re2a.data};
+%\addplot[green,mark=square*,mark options={fill=white}] table {re2a.data};
 \addplot[black,mark=square*,mark options={fill=white}] table {re3a.data};
 \end{axis}
 \end{tikzpicture}
 \end{center}
 \subsection*{Epilogue}
 (23/Aug/2016) I recently found another place where this algorithm can be
-sped (this idea is not integrated with what is coming next,
+sped up (this idea is not integrated with what is coming next,
 but I present it nonetheless). The idea is to define \texttt{ders}
 not such that it iterates the derivative character-by-character, but
 in bigger chunks. The resulting code for \texttt{ders2} looks as
 follows:

changeset 478	dfa7b4ca199f
parent 477	28e872e7efb3
child 479	4fcaa5a2d199
child 480	14318f1d3b0f