handouts/ho02.tex
changeset 504 5dc452d7c08e
parent 492 39b7ff2cf1bc
child 510 25580bf89ac0
--- a/handouts/ho02.tex	Tue Sep 26 14:07:29 2017 +0100
+++ b/handouts/ho02.tex	Tue Sep 26 14:08:49 2017 +0100
@@ -12,16 +12,65 @@
 \section*{Handout 2 (Regular Expression Matching)}
 
 This lecture is about implementing a more efficient regular expression
-matcher (the plots on the right)---more efficient than the matchers
-from regular expression libraries in Ruby, Python and Java (the plots
-on the left). The first pair of plots show the running time for the
-regular expressions $a^?{}^{\{n\}}\cdot a^{\{n\}}$ and strings composed
-of $n$ \pcode{a}s. The second pair of plots show the running time
-for the regular expression $(a^*)^*\cdot b$ and also strings composed
-of $n$ \pcode{a}s (meaning this regular expression actually does not
+matcher (the plots on the right below)---more efficient than the
+matchers from regular expression libraries in Ruby, Python and Java
+(the plots on the left). The first pair of plots shows the running time
+for the regular expression $(a^*)^*\cdot b$ and strings composed of
+$n$ \pcode{a}s (meaning this regular expression actually does not
+match the strings). The second pair of plots shows the running time for
+the regular expressions $a^?{}^{\{n\}}\cdot a^{\{n\}}$ and strings
+also composed of $n$ \pcode{a}s (this time the regular expressions
 match the strings).  To see the substantial differences in the left
-and right plots below, note the different scales of the $x$-axes. 
+and right plots below, note the different scales of the $x$-axes.
+
 
+\begin{center}
+Graphs: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$
+\begin{tabular}{@{}cc@{}}
+\begin{tikzpicture}
+\begin{axis}[
+    xlabel={$n$},
+    x label style={at={(1.05,0.0)}},
+    ylabel={time in secs},
+    enlargelimits=false,
+    xtick={0,5,...,30},
+    xmax=33,
+    ymax=35,
+    ytick={0,5,...,30},
+    scaled ticks=false,
+    axis lines=left,
+    width=5cm,
+    height=5cm, 
+    legend entries={Java, Python},  
+    legend pos=north west,
+    legend cell align=left]
+\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
+\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
+\end{axis}
+\end{tikzpicture}
+&
+\begin{tikzpicture}
+  \begin{axis}[
+    xlabel={$n$},
+    x label style={at={(1.1,0.0)}},
+    %%xtick={0,1000000,...,5000000}, 
+    ylabel={time in secs},
+    enlargelimits=false,
+    ymax=35,
+    ytick={0,5,...,30},
+    axis lines=left,
+    %scaled ticks=false,
+    width=6.5cm,
+    height=5cm,
+    legend entries={Our matcher},  
+    legend pos=north east,
+    legend cell align=left]
+%\addplot[green,mark=square*,mark options={fill=white}] table {re2a.data};    
+\addplot[black,mark=square*,mark options={fill=white}] table {re3a.data};
+\end{axis}
+\end{tikzpicture}
+\end{tabular}
+\end{center}\bigskip
 
 \begin{center}
 Graphs: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$\\
@@ -54,68 +103,33 @@
     x label style={at={(1.1,0.05)}},
     ylabel={\small time in secs},
     enlargelimits=false,
-    xtick={0,3000,...,9000},
-    xmax=10000,
+    xtick={0,2500,...,11000},
+    xmax=12000,
     ymax=35,
     ytick={0,5,...,30},
     scaled ticks=false,
     axis lines=left,
     width=6.5cm,
-    height=5cm]
-\addplot[green,mark=square*,mark options={fill=white}] table {re2.data};
+    height=5cm,
+    legend entries={Our matcher},  
+    legend pos=north east,
+    legend cell align=left]
+%\addplot[green,mark=square*,mark options={fill=white}] table {re2.data};
 \addplot[black,mark=square*,mark options={fill=white}] table {re3.data};
 \end{axis}
 \end{tikzpicture}
 \end{tabular}
 \end{center}
-
-\begin{center}
-Graphs: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$
-\begin{tabular}{@{}cc@{}}
-\begin{tikzpicture}
-\begin{axis}[
-    xlabel={$n$},
-    x label style={at={(1.05,0.0)}},
-    ylabel={time in secs},
-    enlargelimits=false,
-    xtick={0,5,...,30},
-    xmax=33,
-    ymax=35,
-    ytick={0,5,...,30},
-    scaled ticks=false,
-    axis lines=left,
-    width=5cm,
-    height=5cm, 
-    legend entries={Java},  
-    legend pos=north west,
-    legend cell align=left]
-\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
-\end{axis}
-\end{tikzpicture}
-&
-\begin{tikzpicture}
-  \begin{axis}[
-    xlabel={$n$},
-    x label style={at={(1.05,0.0)}},
-    ylabel={time in secs},
-    enlargelimits=false,
-    ymax=35,
-    ytick={0,5,...,30},
-    axis lines=left,
-    scaled ticks=false,
-    width=6.5cm,
-    height=5cm]
-\addplot[green,mark=square*,mark options={fill=white}] table {re2a.data};    
-\addplot[black,mark=square*,mark options={fill=white}] table {re3a.data};
-\end{axis}
-\end{tikzpicture}
-\end{tabular}
-\end{center}\medskip
+\bigskip
 
 \noindent
-We will use these regular expressions and strings
-as running examples.
+In what follows we will use these regular expressions and strings as
+running examples. There will be several versions (V1, V2, V3,\ldots)
+of our matcher.\footnote{The corresponding files are
+  \texttt{re1.scala}, \texttt{re2.scala} and so on. As usual, you can
+  find the code on KEATS.}\bigskip
 
+\noindent
 Having specified in the previous lecture what
 problem our regular expression matcher is supposed to solve,
 namely for any given regular expression $r$ and string $s$
@@ -125,7 +139,7 @@
 s \in L(r)
 \]
 
-\noindent we can look at an algorithm to solve this problem. Clearly
+\noindent we can look for an algorithm to solve this problem. Clearly
 we cannot use the function $L$ directly for this, because in general
 the set of strings $L$ returns is infinite (recall what $L(a^*)$ is).
 In such cases there is no way we can implement an exhaustive test for
@@ -224,13 +238,14 @@
 \end{equation}
 
 \noindent If we can find an equivalent regular expression that is
-simpler (smaller for example), then this might potentially make our
-matching algorithm run faster. We can look for such a simpler regular
-expression $r'$ because whether a string $s$ is in $L(r)$ or in
-$L(r')$ with $r\equiv r'$ will always give the same answer. In the
-example above you will see that the regular expression is equivalent
-to just $r_1$. You can verify this by iteratively applying the
-simplification rules from above:
+simpler (that usually means smaller), then this might potentially make
+our matching algorithm run faster. We can look for such a simpler
+regular expression $r'$ because whether a string $s$ is in $L(r)$ or
+in $L(r')$ with $r\equiv r'$ will always give the same answer. Yes?
+
+In the example above you will see that the regular expression is
+equivalent to just $r_1$. You can verify this by iteratively applying
+the simplification rules from above:
 
 \begin{center}
 \begin{tabular}{ll}
@@ -251,6 +266,24 @@
 $\ZERO$s, therefore simplifying them away will make the
 algorithm quite a bit faster.
 
+Finally here are three equivalences between regular expressions which are
+not so obvious:
+
+\begin{center}
+\begin{tabular}{rcl}
+$r^*$  & $\equiv$ & $1 + r\cdot r^*$\\
+$(r_1 + r_2)^*$  & $\equiv$ & $r_1^* \cdot (r_2\cdot r_1^*)^*$\\
+$(r_1 \cdot r_2)^*$ & $\equiv$ & $1 + r_1\cdot (r_2 \cdot r_1)^* \cdot r_2$\\
+\end{tabular}
+\end{center}
+
+\noindent
+We will not use them in our algorithm, but feel free to convince you
+that they hold. As an aside, there has been a lot of research about
+questions like: Can one always decide when two regular expressions are
+equivalent or not? What does an algorithm look like to decide this
+efficiently?
+
 \subsection*{The Matching Algorithm}
 
 The algorithm we will define below consists of two parts. One
@@ -286,10 +319,10 @@
 \emph{derivative} of a regular expression. This is a function
 which will take a regular expression, say $r$, and a
 character, say $c$, as arguments and returns a new regular
-expression. Be careful that the intuition behind this function
+expression. Be mindful that the intuition behind this function
 is not so easy to grasp on first reading. Essentially this
 function solves the following problem: if $r$ can match a
-string of the form $c\!::\!s$, what does the regular
+string of the form $c\!::\!s$, what does a regular
 expression look like that can match just $s$? The definition
 of this function is as follows:
 
@@ -343,9 +376,9 @@
 and ``append'' $r^*$ in order to match the rest of $s$. Still
 makes sense?
 
-If all this did not make sense yet, here is another way to rationalise
-the definition of $\textit{der}$ by considering the following operation
-on sets:
+If all this did not make sense yet, here is another way to explain the
+definition of $\textit{der}$ by considering the following operation on
+sets:
 
 \begin{equation}\label{Der}
 \textit{Der}\,c\,A\;\dn\;\{s\,|\,c\!::\!s \in A\}
@@ -425,7 +458,7 @@
 \end{center}
 
 \noindent This function iterates $\textit{der}$ taking one character at
-the time from the original string until it is exhausted.
+the time from the original string until the string is exhausted.
 Having $\textit{der}s$ in place, we can finally define our matching
 algorithm:
 
@@ -461,12 +494,16 @@
 for \pcode{matches} are shown in Figure~\ref{scala1}.
 
 \begin{figure}[p]
-\lstinputlisting{../progs/app5.scala}
-\caption{Scala implementation of the \textit{nullable} and 
+  \lstinputlisting[numbers=left,linebackgroundcolor=
+                  {\ifodd\value{lstnumber}\color{capri!3}\fi}]
+                  {../progs/app5.scala}
+\caption{A Scala implementation of the \textit{nullable} and 
   derivative functions. These functions are easy to
-  implement in functional languages, because their built-in pattern 
+  implement in functional languages. This is because pattern 
   matching and recursion allow us to mimic the mathematical
-  definitions very closely.\label{scala1}}
+  definitions very closely. Nearly all functional
+  programming languages support pattern matching and
+  recursion out of the box.\label{scala1}}
 \end{figure}
 
 
@@ -507,8 +544,8 @@
 
 For running the algorithm with our first example, the evil
 regular expression $a^?{}^{\{n\}}a^{\{n\}}$, we need to implement
-the optional regular expression and the exactly $n$-times
-regular expression. This can be done with the translations
+the optional regular expression and the `exactly $n$-times
+regular expression'. This can be done with the translations
 
 \lstinputlisting[numbers=none]{../progs/app51.scala}
 
@@ -541,8 +578,8 @@
 \end{tikzpicture}
 \end{center}
 
-\noindent Analysing this failure we notice that for
-$a^{\{n\}}$ we generate quite big regular expressions:
+\noindent Analysing this failure we notice that for $a^{\{n\}}$, for
+example, we generate quite big regular expressions:
 
 \begin{center}
 \begin{tabular}{rl}
@@ -560,18 +597,18 @@
 large regular expressions will cause problems. This problem
 is aggravated by $a^?$ being represented as $a + \ONE$.
 
-We can however fix this by having an explicit constructor for
+We can however fix this easily by having an explicit constructor for
 $r^{\{n\}}$. In Scala we would introduce a constructor like
 
 \begin{center}
 \code{case class NTIMES(r: Rexp, n: Int) extends Rexp}
 \end{center}
 
-\noindent With this fix we have a constant ``size'' regular
-expression for our running example no matter how large $n$ is.
-This means we have to also add cases for \pcode{NTIMES} in the
-functions $\textit{nullable}$ and $\textit{der}$. Does the change have any
-effect?
+\noindent With this fix we have a constant ``size'' regular expression
+for our running example no matter how large $n$ is (see the
+\texttt{size} section in the implementations).  This means we have to
+also add cases for \pcode{NTIMES} in the functions $\textit{nullable}$
+and $\textit{der}$. Does the change have any effect?
 
 \begin{center}
 \begin{tikzpicture}
@@ -581,8 +618,8 @@
     x label style={at={(1.01,0.0)}},
     ylabel={time in secs},
     enlargelimits=false,
-    xtick={0,100,...,1000},
-    xmax=1100,
+    xtick={0,200,...,1100},
+    xmax=1200,
     ytick={0,5,...,30},
     scaled ticks=false,
     axis lines=left,
@@ -599,12 +636,12 @@
 \end{tikzpicture}
 \end{center}
 
-\noindent Now we are talking business! The modified matcher 
-can within 30 seconds handle regular expressions up to
-$n = 950$ before a StackOverflow is raised. Recall that Python and Ruby 
-(and our first version, Scala V1) could only handle $n = 27$ or so in 30 
-seconds. There is no change for our second example 
-$(a^*)^* \cdot b$---so this is still good.
+\noindent Now we are talking business! The modified matcher can within
+25 seconds handle regular expressions up to $n = 1,100$ before a
+StackOverflow is raised. Recall that Python and Ruby (and our first
+version, Scala V1) could only handle $n = 27$ or so in 30
+seconds. We have not tried our algorithm on the second example $(a^*)^* \cdot
+b$---but it is doing OK with it.
 
 
 The moral is that our algorithm is rather sensitive to the
@@ -614,7 +651,7 @@
 however, one more issue for making the algorithm run faster.
 The derivative function often produces ``useless''
 $\ZERO$s and $\ONE$s. To see this, consider $r = ((a
-\cdot b) + b)^*$ and the following two derivatives
+\cdot b) + b)^*$ and the following three derivatives
 
 \begin{center}
 \begin{tabular}{l}
@@ -625,9 +662,9 @@
 \end{center}
 
 \noindent 
-If we simplify them according to the simple rules from the
-beginning, we can replace the right-hand sides by the 
-smaller equivalent regular expressions
+If we simplify them according to the simplification rules from the
+beginning, we can replace the right-hand sides by the smaller
+equivalent regular expressions
 
 \begin{center}
 \begin{tabular}{l}
@@ -638,21 +675,23 @@
 \end{center}
 
 \noindent I leave it to you to contemplate whether such a
-simplification can have any impact on the correctness of our
-algorithm (will it change any answers?). Figure~\ref{scala2}
-gives a simplification function that recursively traverses a
-regular expression and simplifies it according to the rules
-given at the beginning. There are only rules for $+$, $\cdot$
-and $n$-times (the latter because we added it in the second
-version of our matcher). There is no rule for a star, because
-empirical data and also a little thought showed that
-simplifying under a star is a waste of computation time. The
-simplification function will be called after every derivation.
-This additional step removes all the ``junk'' the derivative
-function introduced. Does this improve the speed? You bet!!
+simplification can have any impact on the correctness of our algorithm
+(will it change any answers?). Figure~\ref{scala2} gives a
+simplification function that recursively traverses a regular
+expression and simplifies it according to the rules given at the
+beginning. There are only rules for $+$, $\cdot$ and $n$-times (the
+latter because we added it in the second version of our
+matcher). There is no simplification rule for a star, because
+empirical data and also a little thought showed that simplifying under
+a star is a waste of computation time. The simplification function
+will be called after every derivation.  This additional step removes
+all the ``junk'' the derivative function introduced. Does this improve
+the speed? You bet!!
 
 \begin{figure}[p]
-\lstinputlisting{../progs/app6.scala}
+\lstinputlisting[numbers=left,linebackgroundcolor=
+  {\ifodd\value{lstnumber}\color{capri!3}\fi}]
+                {../progs/app6.scala}
 \caption{The simplification function and modified 
 \texttt{ders}-function; this function now
 calls \texttt{der} first, but then simplifies
@@ -669,8 +708,8 @@
     x label style={at={(1.04,0.0)}},
     ylabel={time in secs},
     enlargelimits=false,
-    xtick={0,3000,...,9000},
-    xmax=10000,
+    xtick={0,2500,...,10000},
+    xmax=12000,
     ytick={0,5,...,30},
     ymax=32,
     scaled ticks=false,
@@ -687,12 +726,13 @@
 \end{center}
 
 \noindent
-To reacap, Python and Ruby needed approximately 30 seconds to match
-a string of 28 \texttt{a}s and the regular expression $a^{?\{n\}} \cdot a^{\{n\}}$. 
-We need a third of this time to do the same with strings up to 12,000 \texttt{a}s. 
-Similarly, Java needed 30 seconds to find out the regular expression 
-$(a^*)^* \cdot b$ does not match the string of 28 \texttt{a}s. We can do
-the same in approximately 5 seconds for strings of 6000000 \texttt{a}s:
+To reacap, Python and Ruby needed approximately 30 seconds to match a
+string of 28 \texttt{a}s and the regular expression $a^{?\{n\}} \cdot
+a^{\{n\}}$.  We need a third of this time to do the same with strings
+up to 11,000 \texttt{a}s.  Similarly, Java and Python needed 30
+seconds to find out the regular expression $(a^*)^* \cdot b$ does not
+match the string of 28 \texttt{a}s. We can do the same in
+for strings composed of nearly 6,000,000 \texttt{a}s:
 
 
 \begin{center}
@@ -700,20 +740,20 @@
 \begin{axis}[
     title={Graph: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$},
     xlabel={$n$},
-    x label style={at={(1.09,0.0)}},
     ylabel={time in secs},
     enlargelimits=false,
-    xmax=7700000,
+    ymax=35,
     ytick={0,5,...,30},
-    ymax=32,
+    axis lines=left,
     %scaled ticks=false,
-    axis lines=left,
+    x label style={at={(1.09,0.0)}},
+    %xmax=7700000,
     width=9cm,
     height=5cm, 
-    legend entries={Scala V2, Scala V3},
+    legend entries={Scala V3},
     legend pos=outer north east,
     legend cell align=left]
-\addplot[green,mark=square*,mark options={fill=white}] table {re2a.data};
+%\addplot[green,mark=square*,mark options={fill=white}] table {re2a.data};
 \addplot[black,mark=square*,mark options={fill=white}] table {re3a.data};
 \end{axis}
 \end{tikzpicture}
@@ -721,12 +761,11 @@
 
 \subsection*{Epilogue}
 
-(23/Aug/2016) I recently found another place where this algorithm can be 
-sped (this idea is not integrated with what is coming next,
-but I present it nonetheless). The idea is to define \texttt{ders}
-not such that it iterates the derivative character-by-character, but
-in bigger chunks. The resulting code for \texttt{ders2} looks as
-follows:
+(23/Aug/2016) I recently found another place where this algorithm can
+be sped up (this idea is not integrated with what is coming next, but
+I present it nonetheless). The idea is to not define \texttt{ders}
+that it iterates the derivative character-by-character, but in bigger
+chunks. The resulting code for \texttt{ders2} looks as follows:
 
 \lstinputlisting[numbers=none]{../progs/app52.scala} 
 
@@ -756,7 +795,7 @@
     ymax=33,
     %scaled ticks=false,
     axis lines=left,
-    width=5.5cm,
+    width=5.3cm,
     height=5cm, 
     legend entries={Scala V3, Scala V4},
     legend style={at={(0.1,-0.2)},anchor=north}]
@@ -772,12 +811,12 @@
     x label style={at={(1.09,0.0)}},
     ylabel={time in secs},
     enlargelimits=false,
-    xmax=8100000,
+    xmax=8200000,
     ytick={0,5,...,30},
     ymax=33,
     %scaled ticks=false,
     axis lines=left,
-    width=5.5cm,
+    width=5.3cm,
     height=5cm, 
     legend entries={Scala V3, Scala V4},
     legend style={at={(0.1,-0.2)},anchor=north}]
@@ -793,7 +832,7 @@
 
 You might not like doing proofs. But they serve a very
 important purpose in Computer Science: How can we be sure that
-our algorithm matches its specification. We can try to test
+our algorithm matches its specification? We can try to test
 the algorithm, but that often overlooks corner cases and an
 exhaustive testing is impossible (since there are infinitely
 many inputs). Proofs allow us to ensure that an algorithm
@@ -814,7 +853,7 @@
   \end{tabular}
 \end{center}
 
-\noindent If you want to show a property $P(r)$ for all 
+\noindent If you want to show a property $P(r)$ for \emph{all} 
 regular expressions $r$, then you have to follow essentially 
 the recipe:
 
@@ -866,8 +905,9 @@
 \label{propalt}
 \end{equation}
 
-\noindent The difference to the base cases is that in this
-case we can already assume we proved
+\noindent The difference to the base cases is that in the inductive
+cases we can already assume we proved $P$ for the components, that is
+we can assume.
 
 \begin{center}
 \begin{tabular}{l}
@@ -876,9 +916,9 @@
 \end{tabular}
 \end{center}
 
-\noindent These are the induction hypotheses. To check this 
+\noindent These are called the induction hypotheses. To check this 
 case, we can start from $\textit{nullable}(r_1 + r_2)$, which by 
-definition is
+definition of $\textit{nullable}$ is
 
 \[
 \textit{nullable}(r_1) \vee \textit{nullable}(r_2)
@@ -901,18 +941,18 @@
 [] \in L(r_1)\cup L(r_2)
 \]
 
-\noindent but this is by definition of $L$ exactly $[] \in
-L(r_1 + r_2)$, which we needed to establish according to
+\noindent but this is by definition of $L$ exactly $[] \in L(r_1 +
+r_2)$, which we needed to establish according to statement in
 \eqref{propalt}. What we have shown is that starting from
 $\textit{nullable}(r_1 + r_2)$ we have done equivalent transformations
-to end up with $[] \in L(r_1 + r_2)$. Consequently we have
-established that $P(r_1 + r_2)$ holds.
+to end up with $[] \in L(r_1 + r_2)$. Consequently we have established
+that $P(r_1 + r_2)$ holds.
 
 In order to complete the proof we would now need to look 
 at the cases \mbox{$P(r_1\cdot r_2)$} and $P(r^*)$. Again I let you
 check the details.
 
-You might have to do induction proofs over strings. 
+You might also have to do induction proofs over strings. 
 That means you want to establish a property $P(s)$ for all
 strings $s$. For this remember strings are lists of 
 characters. These lists can be either the empty list or a
@@ -948,8 +988,9 @@
 L(\textit{der}\,c\,r) = \textit{Der}\,c\,(L(r))
 \]
 
-\noindent holds (this would be of course a property that
-needs to be proved in a side-lemma by induction on $r$).
+\noindent holds (this would be of course another property that needs
+to be proved in a side-lemma by induction on $r$). This is a bit
+more challenging, but not impossible.
 
 To sum up, using reasoning like the one shown above allows us 
 to show the correctness of our algorithm. To see this,
@@ -968,9 +1009,9 @@
 \label{dersstep}
 \end{equation}
 
-\noindent But we have shown above in \eqref{dersprop}, that
-the $\textit{Ders}$ can be replaced by $L(\textit{ders}\ldots)$. That means 
-\eqref{dersstep} is equivalent to 
+\noindent You agree?  But we have shown above in \eqref{dersprop},
+that the $\textit{Ders}$ can be replaced by
+$L(\textit{ders}\ldots)$. That means \eqref{dersstep} is equivalent to
 
 \begin{equation}
 [] \in L(\textit{ders}\,s\,r)
@@ -999,11 +1040,13 @@
 s\in L(r)
 \]
 
-\noindent which is the property we set out to prove:
-our algorithm meets its specification. To have done
-so, requires a few induction proofs about strings and
-regular expressions. Following the recipes is already a big 
-step in performing these proofs.
+\noindent which is the property we set out to prove: our algorithm
+meets its specification. To have done so, requires a few induction
+proofs about strings and regular expressions. Following the \emph{induction
+recipes} is already a big step in actually performing these proofs.
+If you do not believe it, proofs have helped me to make sure my code
+is correct and in several instances prevented me of letting slip
+embarassing mistakes into the `wild'.
 
 \end{document}