afl-material: comparison handouts/ho02.tex

equal deleted inserted replaced

-:a697421eaa04
+:598741d39d21
 \end{tikzpicture}
 &
 \begin{tikzpicture}
 \begin{axis}[
 xlabel={$n$},
-x label style={at={(1.05,0.0)}},
+x label style={at={(1.1,0.0)}},
+%%xtick={0,1000000,...,5000000},
 ylabel={time in secs},
 enlargelimits=false,
 ymax=35,
 ytick={0,5,...,30},
 axis lines=left,
-scaled ticks=false,
+%scaled ticks=false,
 width=6.5cm,
 height=5cm,
-legend entries={Scala V3},
+legend entries={Our matcher},
 legend pos=north east,
 legend cell align=left]
 %\addplot[green,mark=square*,mark options={fill=white}] table {re2a.data};
 \addplot[black,mark=square*,mark options={fill=white}] table {re3a.data};
 \end{axis}
 \end{tikzpicture}
 \end{tabular}
-\end{center}
+\end{center}\bigskip
 \begin{center}
 Graphs: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$\\
 \begin{tabular}{@{}cc@{}}
 \begin{tikzpicture}
 ytick={0,5,...,30},
 scaled ticks=false,
 axis lines=left,
 width=6.5cm,
 height=5cm,
-legend entries={Scala V3},
+legend entries={Our matcher},
 legend pos=north east,
 legend cell align=left]
 %\addplot[green,mark=square*,mark options={fill=white}] table {re2.data};
 \addplot[black,mark=square*,mark options={fill=white}] table {re3.data};
 \end{axis}
 \end{tikzpicture}
 \end{tabular}
 \end{center}
-\medskip
+\bigskip
 \noindent
-We will use these regular expressions and strings as running
+In what follows we will use these regular expressions and strings as
-examples. There will be several versions (V1, V2, V3,\ldots) of the
+running examples. There will be several versions (V1, V2, V3,\ldots)
-algorithm.\footnote{The corresponding files are \texttt{re1.scala},
+of our matcher.\footnote{The corresponding files are
-\texttt{re2.scala} and so on. As usual, you can find the code on
+\texttt{re1.scala}, \texttt{re2.scala} and so on. As usual, you can
-KEATS.}\bigskip
+find the code on KEATS.}\bigskip
 \noindent
 Having specified in the previous lecture what
 problem our regular expression matcher is supposed to solve,
 namely for any given regular expression $r$ and string $s$
 \[
 s \in L(r)
 \]
-\noindent we can look at an algorithm to solve this problem. Clearly
+\noindent we can look for an algorithm to solve this problem. Clearly
 we cannot use the function $L$ directly for this, because in general
 the set of strings $L$ returns is infinite (recall what $L(a^*)$ is).
 In such cases there is no way we can implement an exhaustive test for
 whether a string is member of this set or not. In contrast our
 matching algorithm will operate on the regular expression $r$ and
 (r_1 + \ZERO) \cdot \ONE + ((\ONE + r_2) + r_3) \cdot (r_4 \cdot \ZERO)
 \label{big}
 \end{equation}
 \noindent If we can find an equivalent regular expression that is
-simpler (smaller for example), then this might potentially make our
+simpler (that usually means smaller), then this might potentially make
-matching algorithm run faster. We can look for such a simpler regular
+our matching algorithm run faster. We can look for such a simpler
-expression $r'$ because whether a string $s$ is in $L(r)$ or in
+regular expression $r'$ because whether a string $s$ is in $L(r)$ or
-$L(r')$ with $r\equiv r'$ will always give the same answer. In the
+in $L(r')$ with $r\equiv r'$ will always give the same answer. Yes?
-example above you will see that the regular expression is equivalent
-to just $r_1$. You can verify this by iteratively applying the
+In the example above you will see that the regular expression is
-simplification rules from above:
+equivalent to just $r_1$. You can verify this by iteratively applying
+the simplification rules from above:
 \begin{center}
 \begin{tabular}{ll}
 & $(r_1 + \ZERO) \cdot \ONE + ((\ONE + r_2) + r_3) \cdot
 (\underline{r_4 \cdot \ZERO})$\smallskip\\
 rule is applied. Our matching algorithm in the next section
 will often generate such ``useless'' $\ONE$s and
 $\ZERO$s, therefore simplifying them away will make the
 algorithm quite a bit faster.
-Finally here are three equivalences between regulare expressions which are
+Finally here are three equivalences between regular expressions which are
 not so obvious:
 \begin{center}
 \begin{tabular}{rcl}
 $r^*$  & $\equiv$ & $1 + r\cdot r^*$\\
 $(r_1 \cdot r_2)^*$ & $\equiv$ & $1 + r_1\cdot (r_2 \cdot r_1)^* \cdot r_2$\\
 \end{tabular}
 \end{center}
 \noindent
-You can try to establish them. As an aside, there has been a lot of research
+We will not use them in our algorithm, but feel free to make sure they
-in questions like: Can one always decide when two regular expressions are
+hold. As an aside, there has been a lot of research about questions
-equivalent or not? What does an algorithm look like to decide this?
+like: Can one always decide when two regular expressions are
+equivalent or not? What does an algorithm look like to decide this
+efficiently?
 \subsection*{The Matching Algorithm}
 The algorithm we will define below consists of two parts. One
 is the function $\textit{nullable}$ which takes a regular expression as
 The other function of our matching algorithm calculates a
 \emph{derivative} of a regular expression. This is a function
 which will take a regular expression, say $r$, and a
 character, say $c$, as arguments and returns a new regular
-expression. Be careful that the intuition behind this function
+expression. Be mindful that the intuition behind this function
 is not so easy to grasp on first reading. Essentially this
 function solves the following problem: if $r$ can match a
-string of the form $c\!::\!s$, what does the regular
+string of the form $c\!::\!s$, what does a regular
 expression look like that can match just $s$? The definition
 of this function is as follows:
 \begin{center}
 \begin{tabular}{l@ {\hspace{2mm}}c@ {\hspace{2mm}}l}
 $c\!::\!s$, then the first part must be ``matched'' by a
 single copy of $r$. Therefore we call recursively $\textit{der}\,c\,r$
 and ``append'' $r^*$ in order to match the rest of $s$. Still
 makes sense?
-If all this did not make sense yet, here is another way to rationalise
+If all this did not make sense yet, here is another way to explain the
-the definition of $\textit{der}$ by considering the following operation
+definition of $\textit{der}$ by considering the following operation on
-on sets:
+sets:
 \begin{equation}\label{Der}
 \textit{Der}\,c\,A\;\dn\;\{s\,|\,c\!::\!s \in A\}
 \end{equation}
 $\textit{ders}\, (c\!::\!s)\, r$ & $\dn$ & $\textit{ders}\,s\,(\textit{der}\,c\,r)$ & \\
 \end{tabular}
 \end{center}
 \noindent This function iterates $\textit{der}$ taking one character at
-the time from the original string until it is exhausted.
+the time from the original string until the string is exhausted.
 Having $\textit{der}s$ in place, we can finally define our matching
 algorithm:
 \[
 \textit{matches}\,s\,r \dn \textit{nullable}(\textit{ders}\,s\,r)
 \begin{figure}[p]
 \lstinputlisting[numbers=left,linebackgroundcolor=
 {\ifodd\value{lstnumber}\color{capri!3}\fi}]
 {../progs/app5.scala}
-\caption{Scala implementation of the \textit{nullable} and
+\caption{A Scala implementation of the \textit{nullable} and
 derivative functions. These functions are easy to
-implement in functional languages, because their built-in pattern
+implement in functional languages. This is because pattern
 matching and recursion allow us to mimic the mathematical
-definitions very closely.\label{scala1}}
+definitions very closely. Nearly all functional
+programming languages support pattern matching and
+recursion out of the box.\label{scala1}}
 \end{figure}
 %Remember our second example involving the regular expression
 %$(a^*)^* \cdot b$ which could not match strings of $n$ \texttt{a}s.
 %strings up to the length of 6500. After that we receive a
 %StackOverflow exception, but still\ldots
 For running the algorithm with our first example, the evil
 regular expression $a^?{}^{\{n\}}a^{\{n\}}$, we need to implement
-the optional regular expression and the exactly $n$-times
+the optional regular expression and the `exactly $n$-times
-regular expression. This can be done with the translations
+regular expression'. This can be done with the translations
 \lstinputlisting[numbers=none]{../progs/app51.scala}
 \noindent Running the matcher with this example, we find it is
 slightly worse then the matcher in Ruby and Python.
 \addplot[red,mark=triangle*,mark options={fill=white}] table {re1.data};
 \end{axis}
 \end{tikzpicture}
 \end{center}
-\noindent Analysing this failure we notice that for
+\noindent Analysing this failure we notice that for $a^{\{n\}}$, for
-$a^{\{n\}}$ we generate quite big regular expressions:
+example, we generate quite big regular expressions:
 \begin{center}
 \begin{tabular}{rl}
 1: & $a$\\
 2: & $a\cdot a$\\
 \noindent Our algorithm traverses such regular expressions at
 least once every time a derivative is calculated. So having
 large regular expressions will cause problems. This problem
 is aggravated by $a^?$ being represented as $a + \ONE$.
-We can however fix this by having an explicit constructor for
+We can however fix this easily by having an explicit constructor for
 $r^{\{n\}}$. In Scala we would introduce a constructor like
 \begin{center}
 \code{case class NTIMES(r: Rexp, n: Int) extends Rexp}
 \end{center}
 \noindent Now we are talking business! The modified matcher can within
 25 seconds handle regular expressions up to $n = 1,100$ before a
 StackOverflow is raised. Recall that Python and Ruby (and our first
 version, Scala V1) could only handle $n = 27$ or so in 30
-seconds. There is no change for our second example $(a^*)^* \cdot
+seconds. We have not tried our algorithm on the second example $(a^*)^* \cdot
-b$---so this is still good.
+b$---but it is doing OK with it.
 The moral is that our algorithm is rather sensitive to the
 size of regular expressions it needs to handle. This is of
 course obvious because both $\textit{nullable}$ and $\textit{der}$ frequently
 $\textit{der}\,c\,r = ((\ZERO \cdot b) + \ZERO)\cdot r$
 \end{tabular}
 \end{center}
 \noindent
-If we simplify them according to the simple rules from the
+If we simplify them according to the simplification rules from the
-beginning, we can replace the right-hand sides by the
+beginning, we can replace the right-hand sides by the smaller
-smaller equivalent regular expressions
+equivalent regular expressions
 \begin{center}
 \begin{tabular}{l}
 $\textit{der}\,a\,r \equiv b \cdot r$\\
 $\textit{der}\,b\,r \equiv r$\\
 string of 28 \texttt{a}s and the regular expression $a^{?\{n\}} \cdot
 a^{\{n\}}$.  We need a third of this time to do the same with strings
 up to 11,000 \texttt{a}s.  Similarly, Java and Python needed 30
 seconds to find out the regular expression $(a^*)^* \cdot b$ does not
 match the string of 28 \texttt{a}s. We can do the same in
-for strings of 6,000,000 \texttt{a}s:
+for strings composed of nearly 6,000,000 \texttt{a}s:
 \begin{center}
 \begin{tikzpicture}
 \begin{axis}[
 ylabel={time in secs},
 enlargelimits=false,
 ymax=35,
 ytick={0,5,...,30},
 axis lines=left,
-scaled ticks=false,
+%scaled ticks=false,
 x label style={at={(1.09,0.0)}},
 %xmax=7700000,
 width=9cm,
 height=5cm,
 legend entries={Scala V3},
 \end{tikzpicture}
 \end{center}
 \subsection*{Epilogue}
-(23/Aug/2016) I recently found another place where this algorithm can be
+(23/Aug/2016) I recently found another place where this algorithm can
-sped up (this idea is not integrated with what is coming next,
+be sped up (this idea is not integrated with what is coming next, but
-but I present it nonetheless). The idea is to define \texttt{ders}
+I present it nonetheless). The idea is to not define \texttt{ders}
-not such that it iterates the derivative character-by-character, but
+that it iterates the derivative character-by-character, but in bigger
-in bigger chunks. The resulting code for \texttt{ders2} looks as
+chunks. The resulting code for \texttt{ders2} looks as follows:
-follows:
 \lstinputlisting[numbers=none]{../progs/app52.scala}
 \noindent
 I have not fully understood why this version is much faster,
 xmax=7100000,
 ytick={0,5,...,30},
 ymax=33,
 %scaled ticks=false,
 axis lines=left,
-width=5.5cm,
+width=5.3cm,
 height=5cm,
 legend entries={Scala V3, Scala V4},
 legend style={at={(0.1,-0.2)},anchor=north}]
 \addplot[black,mark=square*,mark options={fill=white}] table {re3.data};
 \addplot[purple,mark=square*,mark options={fill=white}] table {re4.data};
 title={Graph: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$},
 xlabel={$n$},
 x label style={at={(1.09,0.0)}},
 ylabel={time in secs},
 enlargelimits=false,
-xmax=8100000,
+xmax=8200000,
 ytick={0,5,...,30},
 ymax=33,
 %scaled ticks=false,
 axis lines=left,
-width=5.5cm,
+width=5.3cm,
 height=5cm,
 legend entries={Scala V3, Scala V4},
 legend style={at={(0.1,-0.2)},anchor=north}]
 \addplot[black,mark=square*,mark options={fill=white}] table {re3a.data};
 \addplot[purple,mark=square*,mark options={fill=white}] table {re4a.data};
 \section*{Proofs}
 You might not like doing proofs. But they serve a very
 important purpose in Computer Science: How can we be sure that
-our algorithm matches its specification. We can try to test
+our algorithm matches its specification? We can try to test
 the algorithm, but that often overlooks corner cases and an
 exhaustive testing is impossible (since there are infinitely
 many inputs). Proofs allow us to ensure that an algorithm
 really meets its specification.
 & $\mid$ & $r_1 \cdot r_2$      & sequence\\
 & $\mid$ & $r^*$                & star (zero or more)\\
 \end{tabular}
 \end{center}
-\noindent If you want to show a property $P(r)$ for all
+\noindent If you want to show a property $P(r)$ for \emph{all}
 regular expressions $r$, then you have to follow essentially
 the recipe:
 \begin{itemize}
 \item $P$ has to hold for $\ZERO$, $\ONE$ and $c$
 \textit{nullable}(r_1 + r_2) \;\;\text{if and only if}\;\;
 []\in L(r_1 + r_2)
 \label{propalt}
 \end{equation}
-\noindent The difference to the base cases is that in this
+\noindent The difference to the base cases is that in the inductive
-case we can already assume we proved
+cases we can already assume we proved $P$ for the components, that is
+we can assume.
 \begin{center}
 \begin{tabular}{l}
 $\textit{nullable}(r_1) \;\;\text{if and only if}\;\; []\in L(r_1)$ and\\
 $\textit{nullable}(r_2) \;\;\text{if and only if}\;\; []\in L(r_2)$\\
 \end{tabular}
 \end{center}
-\noindent These are the induction hypotheses. To check this
+\noindent These are called the induction hypotheses. To check this
 case, we can start from $\textit{nullable}(r_1 + r_2)$, which by
-definition is
+definition of $\textit{nullable}$ is
 \[
 \textit{nullable}(r_1) \vee \textit{nullable}(r_2)
 \]
 \[
 [] \in L(r_1)\cup L(r_2)
 \]
-\noindent but this is by definition of $L$ exactly $[] \in
+\noindent but this is by definition of $L$ exactly $[] \in L(r_1 +
-L(r_1 + r_2)$, which we needed to establish according to
+r_2)$, which we needed to establish according to statement in
 \eqref{propalt}. What we have shown is that starting from
 $\textit{nullable}(r_1 + r_2)$ we have done equivalent transformations
-to end up with $[] \in L(r_1 + r_2)$. Consequently we have
+to end up with $[] \in L(r_1 + r_2)$. Consequently we have established
-established that $P(r_1 + r_2)$ holds.
+that $P(r_1 + r_2)$ holds.
 In order to complete the proof we would now need to look
 at the cases \mbox{$P(r_1\cdot r_2)$} and $P(r^*)$. Again I let you
 check the details.
-You might have to do induction proofs over strings.
+You might also have to do induction proofs over strings.
 That means you want to establish a property $P(s)$ for all
 strings $s$. For this remember strings are lists of
 characters. These lists can be either the empty list or a
 list of the form $c::s$. If you want to perform an induction
 proof for strings you need to consider the cases
 \[
 L(\textit{der}\,c\,r) = \textit{Der}\,c\,(L(r))
 \]
-\noindent holds (this would be of course a property that
+\noindent holds (this would be of course another property that needs
-needs to be proved in a side-lemma by induction on $r$).
+to be proved in a side-lemma by induction on $r$). This is a bit
+more challenging, but not impossible.
 To sum up, using reasoning like the one shown above allows us
 to show the correctness of our algorithm. To see this,
 start from the specification
 \begin{equation}
 [] \in \textit{Ders}\,s\,(L(r))
 \label{dersstep}
 \end{equation}
-\noindent But we have shown above in \eqref{dersprop}, that
+\noindent You agree?  But we have shown above in \eqref{dersprop},
-the $\textit{Ders}$ can be replaced by $L(\textit{ders}\ldots)$. That means
+that the $\textit{Ders}$ can be replaced by
-\eqref{dersstep} is equivalent to
+$L(\textit{ders}\ldots)$. That means \eqref{dersstep} is equivalent to
 \begin{equation}
 [] \in L(\textit{ders}\,s\,r)
 \label{prefinalstep}
 \end{equation}
 \[
 matches\,s\,r\;\;\text{if and only if}\;\;
 s\in L(r)
 \]
-\noindent which is the property we set out to prove:
+\noindent which is the property we set out to prove: our algorithm
-our algorithm meets its specification. To have done
+meets its specification. To have done so, requires a few induction
-so, requires a few induction proofs about strings and
+proofs about strings and regular expressions. Following the \emph{induction
-regular expressions. Following the recipes is already a big
+recipes} is already a big step in actually performing these proofs.
-step in performing these proofs.
+If you do not believe it, proofs have helped me to make sure my code
+is correct and in several instances prevented me of letting slip
+embarassing mistakes into the `wild'.
 \end{document}

changeset 488	598741d39d21
parent 481	acd8780bfc8b
child 492	39b7ff2cf1bc