# HG changeset patch # User Christian Urban # Date 1489588450 0 # Node ID 48b842c997c7f0624207873f9530ff043776a5c6 # Parent b78664a24f5d1303e3643525305b7f5fe3e99ee9 updated diff -r b78664a24f5d -r 48b842c997c7 data.sty --- a/data.sty Wed Mar 15 01:24:39 2017 +0000 +++ b/data.sty Wed Mar 15 14:34:10 2017 +0000 @@ -198,6 +198,21 @@ %% re3.scala: example (a*)* b \begin{filecontents}{re3a.data} +1 0.00003 +500001 0.22527 +1000001 0.62752 +1500001 0.88485 +2000001 1.39815 +2500001 1.68619 +3000001 1.94957 +3500001 2.15878 +4000001 2.59918 +4500001 5.90679 +5000001 13.11295 +5500001 19.15376 +6000001 40.16373 +\end{filecontents} +\begin{filecontents}{re3b.data} 1 0.00015 500001 0.28337 1000001 0.53271 diff -r b78664a24f5d -r 48b842c997c7 handouts/ho02.pdf Binary file handouts/ho02.pdf has changed diff -r b78664a24f5d -r 48b842c997c7 handouts/ho02.tex --- a/handouts/ho02.tex Wed Mar 15 01:24:39 2017 +0000 +++ b/handouts/ho02.tex Wed Mar 15 14:34:10 2017 +0000 @@ -13,16 +13,64 @@ \section*{Handout 2 (Regular Expression Matching)} This lecture is about implementing a more efficient regular expression -matcher (the plots on the right)---more efficient than the matchers -from regular expression libraries in Ruby, Python and Java (the plots -on the left). The first pair of plots show the running time for the -regular expressions $a^?{}^{\{n\}}\cdot a^{\{n\}}$ and strings composed -of $n$ \pcode{a}s. The second pair of plots show the running time -for the regular expression $(a^*)^*\cdot b$ and also strings composed -of $n$ \pcode{a}s (meaning this regular expression actually does not +matcher (the plots on the right below)---more efficient than the +matchers from regular expression libraries in Ruby, Python and Java +(the plots on the left). The first pair of plots show the running time +for the regular expression $(a^*)^*\cdot b$ and strings composed of +$n$ \pcode{a}s (meaning this regular expression actually does not +match the strings). The second pair of plots show the running time for +the regular expressions $a^?{}^{\{n\}}\cdot a^{\{n\}}$ and strings +also composed of $n$ \pcode{a}s (this time the regular expressions match the strings). To see the substantial differences in the left -and right plots below, note the different scales of the $x$-axes. +and right plots below, note the different scales of the $x$-axes. + +\begin{center} +Graphs: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$ +\begin{tabular}{@{}cc@{}} +\begin{tikzpicture} +\begin{axis}[ + xlabel={$n$}, + x label style={at={(1.05,0.0)}}, + ylabel={time in secs}, + enlargelimits=false, + xtick={0,5,...,30}, + xmax=33, + ymax=35, + ytick={0,5,...,30}, + scaled ticks=false, + axis lines=left, + width=5cm, + height=5cm, + legend entries={Java, Python}, + legend pos=north west, + legend cell align=left] +\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data}; +\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data}; +\end{axis} +\end{tikzpicture} +& +\begin{tikzpicture} + \begin{axis}[ + xlabel={$n$}, + x label style={at={(1.05,0.0)}}, + ylabel={time in secs}, + enlargelimits=false, + ymax=35, + ytick={0,5,...,30}, + axis lines=left, + scaled ticks=false, + width=6.5cm, + height=5cm, + legend entries={Scala V3}, + legend pos=north east, + legend cell align=left] +%\addplot[green,mark=square*,mark options={fill=white}] table {re2a.data}; +\addplot[black,mark=square*,mark options={fill=white}] table {re3a.data}; +\end{axis} +\end{tikzpicture} +\end{tabular} +\end{center} \begin{center} Graphs: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$\\ @@ -62,62 +110,26 @@ scaled ticks=false, axis lines=left, width=6.5cm, - height=5cm] -\addplot[green,mark=square*,mark options={fill=white}] table {re2.data}; + height=5cm, + legend entries={Scala V3}, + legend pos=north east, + legend cell align=left] +%\addplot[green,mark=square*,mark options={fill=white}] table {re2.data}; \addplot[black,mark=square*,mark options={fill=white}] table {re3.data}; \end{axis} \end{tikzpicture} \end{tabular} \end{center} - -\begin{center} -Graphs: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$ -\begin{tabular}{@{}cc@{}} -\begin{tikzpicture} -\begin{axis}[ - xlabel={$n$}, - x label style={at={(1.05,0.0)}}, - ylabel={time in secs}, - enlargelimits=false, - xtick={0,5,...,30}, - xmax=33, - ymax=35, - ytick={0,5,...,30}, - scaled ticks=false, - axis lines=left, - width=5cm, - height=5cm, - legend entries={Java, Python}, - legend pos=north west, - legend cell align=left] -\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data}; -\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data}; -\end{axis} -\end{tikzpicture} -& -\begin{tikzpicture} - \begin{axis}[ - xlabel={$n$}, - x label style={at={(1.05,0.0)}}, - ylabel={time in secs}, - enlargelimits=false, - ymax=35, - ytick={0,5,...,30}, - axis lines=left, - scaled ticks=false, - width=6.5cm, - height=5cm] -\addplot[green,mark=square*,mark options={fill=white}] table {re2a.data}; -\addplot[black,mark=square*,mark options={fill=white}] table {re3a.data}; -\end{axis} -\end{tikzpicture} -\end{tabular} -\end{center}\medskip +\medskip \noindent -We will use these regular expressions and strings -as running examples. +We will use these regular expressions and strings as running +examples. There will be several versions (V1, V2, V3,\ldots) of the +algorithm.\footnote{The corresponding files are \texttt{re1.scala}, + \texttt{re2.scala} and so on. As usual, you can find the code on + KEATS.}\bigskip +\noindent Having specified in the previous lecture what problem our regular expression matcher is supposed to solve, namely for any given regular expression $r$ and string $s$ @@ -571,11 +583,11 @@ \code{case class NTIMES(r: Rexp, n: Int) extends Rexp} \end{center} -\noindent With this fix we have a constant ``size'' regular -expression for our running example no matter how large $n$ is. -This means we have to also add cases for \pcode{NTIMES} in the -functions $\textit{nullable}$ and $\textit{der}$. Does the change have any -effect? +\noindent With this fix we have a constant ``size'' regular expression +for our running example no matter how large $n$ is (see the +\texttt{size} section in the implementations). This means we have to +also add cases for \pcode{NTIMES} in the functions $\textit{nullable}$ +and $\textit{der}$. Does the change have any effect? \begin{center} \begin{tikzpicture} @@ -603,12 +615,12 @@ \end{tikzpicture} \end{center} -\noindent Now we are talking business! The modified matcher -can within 25 seconds handle regular expressions up to -$n = 1,100$ before a StackOverflow is raised. Recall that Python and Ruby -(and our first version, Scala V1) could only handle $n = 27$ or so in 30 -seconds. There is no change for our second example -$(a^*)^* \cdot b$---so this is still good. +\noindent Now we are talking business! The modified matcher can within +25 seconds handle regular expressions up to $n = 1,100$ before a +StackOverflow is raised. Recall that Python and Ruby (and our first +version, Scala V1) could only handle $n = 27$ or so in 30 +seconds. There is no change for our second example $(a^*)^* \cdot +b$---so this is still good. The moral is that our algorithm is rather sensitive to the @@ -618,7 +630,7 @@ however, one more issue for making the algorithm run faster. The derivative function often produces ``useless'' $\ZERO$s and $\ONE$s. To see this, consider $r = ((a -\cdot b) + b)^*$ and the following two derivatives +\cdot b) + b)^*$ and the following three derivatives \begin{center} \begin{tabular}{l} @@ -642,18 +654,18 @@ \end{center} \noindent I leave it to you to contemplate whether such a -simplification can have any impact on the correctness of our -algorithm (will it change any answers?). Figure~\ref{scala2} -gives a simplification function that recursively traverses a -regular expression and simplifies it according to the rules -given at the beginning. There are only rules for $+$, $\cdot$ -and $n$-times (the latter because we added it in the second -version of our matcher). There is no rule for a star, because -empirical data and also a little thought showed that -simplifying under a star is a waste of computation time. The -simplification function will be called after every derivation. -This additional step removes all the ``junk'' the derivative -function introduced. Does this improve the speed? You bet!! +simplification can have any impact on the correctness of our algorithm +(will it change any answers?). Figure~\ref{scala2} gives a +simplification function that recursively traverses a regular +expression and simplifies it according to the rules given at the +beginning. There are only rules for $+$, $\cdot$ and $n$-times (the +latter because we added it in the second version of our +matcher). There is no simplification rule for a star, because +empirical data and also a little thought showed that simplifying under +a star is a waste of computation time. The simplification function +will be called after every derivation. This additional step removes +all the ``junk'' the derivative function introduced. Does this improve +the speed? You bet!! \begin{figure}[p] \lstinputlisting[numbers=left,linebackgroundcolor= @@ -675,8 +687,8 @@ x label style={at={(1.04,0.0)}}, ylabel={time in secs}, enlargelimits=false, - xtick={0,3000,...,9000}, - xmax=10000, + xtick={0,2500,...,10000}, + xmax=12000, ytick={0,5,...,30}, ymax=32, scaled ticks=false, @@ -693,12 +705,13 @@ \end{center} \noindent -To reacap, Python and Ruby needed approximately 30 seconds to match -a string of 28 \texttt{a}s and the regular expression $a^{?\{n\}} \cdot a^{\{n\}}$. -We need a third of this time to do the same with strings up to 12,000 \texttt{a}s. -Similarly, Java needed 30 seconds to find out the regular expression -$(a^*)^* \cdot b$ does not match the string of 28 \texttt{a}s. We can do -the same in approximately 5 seconds for strings of 6000000 \texttt{a}s: +To reacap, Python and Ruby needed approximately 30 seconds to match a +string of 28 \texttt{a}s and the regular expression $a^{?\{n\}} \cdot +a^{\{n\}}$. We need a third of this time to do the same with strings +up to 11,000 \texttt{a}s. Similarly, Java and Python needed 30 +seconds to find out the regular expression $(a^*)^* \cdot b$ does not +match the string of 28 \texttt{a}s. We can do the same in +for strings of 6,000,000 \texttt{a}s: \begin{center} @@ -706,20 +719,20 @@ \begin{axis}[ title={Graph: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$}, xlabel={$n$}, - x label style={at={(1.09,0.0)}}, ylabel={time in secs}, enlargelimits=false, - xmax=7700000, + ymax=35, ytick={0,5,...,30}, - ymax=32, - %scaled ticks=false, axis lines=left, + scaled ticks=false, + x label style={at={(1.09,0.0)}}, + %xmax=7700000, width=9cm, height=5cm, - legend entries={Scala V2, Scala V3}, + legend entries={Scala V3}, legend pos=outer north east, legend cell align=left] -\addplot[green,mark=square*,mark options={fill=white}] table {re2a.data}; +%\addplot[green,mark=square*,mark options={fill=white}] table {re2a.data}; \addplot[black,mark=square*,mark options={fill=white}] table {re3a.data}; \end{axis} \end{tikzpicture} @@ -728,7 +741,7 @@ \subsection*{Epilogue} (23/Aug/2016) I recently found another place where this algorithm can be -sped (this idea is not integrated with what is coming next, +sped up (this idea is not integrated with what is coming next, but I present it nonetheless). The idea is to define \texttt{ders} not such that it iterates the derivative character-by-character, but in bigger chunks. The resulting code for \texttt{ders2} looks as diff -r b78664a24f5d -r 48b842c997c7 handouts/ho03.pdf Binary file handouts/ho03.pdf has changed diff -r b78664a24f5d -r 48b842c997c7 progs/re2.scala --- a/progs/re2.scala Wed Mar 15 01:24:39 2017 +0000 +++ b/progs/re2.scala Wed Mar 15 14:34:10 2017 +0000 @@ -103,7 +103,7 @@ size(EVIL1(7)) // 7 -// but the size of the derivatives can grow +// but the size of the derivatives can still grow // quite dramatically size(ders("".toList, EVIL2)) // 5