afl-material: comparison handouts/ho02.tex

equal deleted inserted replaced

-:c235e0aeb8df
+:539b2e88f5b9
 This lecture is about implementing a more efficient regular
 expression matcher (the plots on the right)---more efficient
 than the matchers from regular expression libraries in Ruby
 and Python (the plots on the left). These plots show the
 running time for the evil regular expression
-$a?^{\{n\}}a^{\{n\}}$ and string composed of $n$ \pcode{a}s.
+$a^?^{\{n\}}\cdot a^{\{n\}}$ and strings composed of $n$ \pcode{a}s.
 We will use this regular expression and strings as running
 example. To see the substantial differences in the two plots
 below, note the different scales of the $x$-axes.
 easy to verify since $L(\epsilon) = \{[]\}$ and appending the
 empty string to every string of another set, leaves the set
 unchanged. Be careful to fully comprehend the fifth and sixth
 equivalence: if you concatenate two sets of strings and one is
 the empty set, then the concatenation will also be the empty
-set. To see this, check the definition of \pcode{_ @ _}. The
+set. To see this, check the definition of $\_ @ \_$. The
 last equivalence is again trivial.
 What will be important later on is that we can orient these
 equivalences and read them from left to right. In this way we
 can view them as \emph{simplification rules}. Consider for
 \noindent If we can find an equivalent regular expression that
 is simpler (smaller for example), then this might potentially
 make our matching algorithm run faster. The reason is that
 whether a string $s$ is in $L(r)$ or in $L(r')$ with $r\equiv
 r'$ will always give the same answer. In the example above you
-will see that the regular expression is equivalent to $r_1$.
+will see that the regular expression is equivalent to just $r_1$.
 You can verify this by iteratively applying the simplification
 rules from above:
 \begin{center}
 \begin{tabular}{ll}
 \noindent This operation essentially transforms a set of
 strings $A$ by filtering out all strings that do not start
 with $c$ and then strips off the $c$ from all the remaining
 strings. For example suppose $A = \{f\!oo, bar, f\!rak\}$ then
-\[ Der\,f\,A = \{oo, rak\}\quad,\quad Der\,b\,A = \{ar\} \quad
-\text{and} \quad Der\,a\,A = \varnothing \]
+\[ Der\,f\,A = \{oo, rak\}\quad,\quad
+Der\,b\,A = \{ar\} \quad \text{and} \quad
+Der\,a\,A = \{\}
+\]
 \noindent
 Note that in the last case $Der$ is empty, because no string in $A$
 starts with $a$. With this operation we can state the following
 property about $der$:
 matching and recursion allow us to mimic the mathematical
 definitions very closely.\label{scala1}}
 \end{figure}
 For running the algorithm with our favourite example, the evil
-regular expression $a?^{\{n\}}a^{\{n\}}$, we need to implement
+regular expression $a^?^{\{n\}}a^{\{n\}}$, we need to implement
 the optional regular expression and the exactly $n$-times
 regular expression. This can be done with the translations
 \lstinputlisting[numbers=none]{../progs/app51.scala}
 \end{center}
 \noindent Our algorithm traverses such regular expressions at
 least once every time a derivative is calculated. So having
 large regular expressions will cause problems. This problem
-is aggravated by $a?$ being represented as $a + \epsilon$.
+is aggravated by $a^?$ being represented as $a + \epsilon$.
 We can however fix this by having an explicit constructor for
 $r^{\{n\}}$. In Scala we would introduce a constructor like
 \begin{center}
 \end{tikzpicture}
 \end{center}
 \noindent Now we are talking business! The modified matcher
 can within 30 seconds handle regular expressions up to
-$n = 950$ before a StackOverflow is raised.
+$n = 950$ before a StackOverflow is raised. Python and Ruby
+(and our first version) could only handle $n = 27$ or so in 30
+second.
 The moral is that our algorithm is rather sensitive to the
 size of regular expressions it needs to handle. This is of
 course obvious because both $nullable$ and $der$ frequently
 need to traverse the whole regular expression. There seems,
 \begin{figure}[p]
 \lstinputlisting{../progs/app6.scala}
 \caption{The simplification function and modified
 \texttt{ders}-function; this function now
 calls \texttt{der} first, but then simplifies
-the resulting derivative regular expressions, see
+the resulting derivative regular expressions before
+building the next derivative, see
 Line~28.\label{scala2}}
 \end{figure}
 \begin{center}
 \begin{tikzpicture}
 xmax=12000,
 ytick={0,5,...,30},
 scaled ticks=false,
 axis lines=left,
 width=9cm,
-height=4cm,
+height=5cm,
 legend entries={Scala V2,Scala V3}]
 \addplot[green,mark=square*,mark options={fill=white}] table {re2b.data};
 \addplot[black,mark=square*,mark options={fill=white}] table {re3.data};
 \end{axis}
 \end{tikzpicture}
 \end{center}
 \section*{Proofs}
 You might not like doing proofs. But they serve a very
-important purpose in Computer Science: When can we be sure
+important purpose in Computer Science: How can we be sure that
-that our algorithms match their specification. We can try to
+our algorithm matches its specification. We can try to test
-test the algorithms, but that often overlooks corner cases and
+the algorithm, but that often overlooks corner cases and an
-also often an exhaustive testing is impossible (since there
+exhaustive testing is impossible (since there are infinitely
-are often infinitely many inputs). Proofs allow us to ensure
+many inputs). Proofs allow us to ensure that an algorithm
-that an algorithm meets its specification.
+really meets its specification.
 For the programs we look at in this module, the proofs will
 mostly by some form of induction. Remember that regular
 expressions are defined as
 \noindent
 A simple proof is for example showing the following
 property:
-\[
+\begin{equation}
 nullable(r) \;\;\text{if and only if}\;\; []\in L(r)
-\]
+\label{nullableprop}
+\end{equation}
 \noindent
 Let us say that this property is $P(r)$, then the first case
 we need to check is whether $P(\varnothing)$ (see recipe
 above). So we have to show that
 nullable(\varnothing) \;\;\text{if and only if}\;\;
 []\in L(\varnothing)
 \]
 \noindent whereby $nullable(\varnothing)$ is by definition of
-the function $nullable$ always $\textit{false}$. We also
+the function $nullable$ always $\textit{false}$. We also have
-have that $L(\varnothing)$ is by definition $\{\}$. It is
+that $L(\varnothing)$ is by definition $\{\}$. It is
 impossible that the empty string $[]$ is in the empty set.
 Therefore also the right-hand side is false. Consequently we
-verified this case. We would still need to do this for
+verified this case: both sides are false. We would still need
-$P(\varepsilon)$ and $P(c)$. I leave this to you to check.
+to do this for $P(\varepsilon)$ and $P(c)$. I leave this to
+you to verify.
 Next we need to check the inductive cases, for example
 $P(r_1 + r_2)$, which is
-\[
+\begin{equation}
 nullable(r_1 + r_2) \;\;\text{if and only if}\;\;
 []\in L(r_1 + r_2)
-\]
+\label{propalt}
+\end{equation}
 \noindent The difference to the base cases is that in this
 case we can already assume we proved
 \begin{center}
 \begin{tabular}{l}
-$nullable(r_1) \;\;\text{if and only if}\;\; []\in L(r_1)$\\
+$nullable(r_1) \;\;\text{if and only if}\;\; []\in L(r_1)$ and\\
 $nullable(r_2) \;\;\text{if and only if}\;\; []\in L(r_2)$\\
 \end{tabular}
 \end{center}
 \noindent These are the induction hypotheses. To check this
-case we can start from $nullable(r_1 + r_2)$, which by
+case, we can start from $nullable(r_1 + r_2)$, which by
 definition is
 \[
 nullable(r_1) \vee nullable(r_2)
 \]
-\noindent Using the induction hypotheses we can transform
+\noindent Using the two induction hypotheses from above,
-this into
+we can transform this into
 \[
 [] \in L(r_1) \vee []\in(r_2)
 \]
 \noindent We just replaced the $nullable(\ldots)$ parts by
 the equivalent $[] \in L(\ldots)$ from the induction
 hypotheses. A bit of thinking convinces you that if
-$[] \in L(r_1) \vee []\in(r_2)$ then the empty string
+$[] \in L(r_1) \vee []\in L(r_2)$ then the empty string
 must be in the union $L(r_1)\cup L(r_2)$, that is
 \[
 [] \in L(r_1)\cup L(r_2)
 \]
-\noindent but this is by definition of $L$ exactly
+\noindent but this is by definition of $L$ exactly $[] \in
-$[] \in L(r_1 + r_2)$, which we needed to establish.
+L(r_1 + r_2)$, which we needed to establish according to
-What we have shown is that starting from
+\eqref{propalt}. What we have shown is that starting from
 $nullable(r_1 + r_2)$ we have done equivalent transformations
 to end up with $[] \in L(r_1 + r_2)$. Consequently we have
 established that $P(r_1 + r_2)$ holds.
 In order to complete the proof we would now need to look
-at the cases $P(r_1\cdot r_2)$ and $P(r^*)$. Again I let you
+at the cases \mbox{$P(r_1\cdot r_2)$} and $P(r^*)$. Again I let you
 check the details.
-Finally, you might have to do induction proofs over strings.
+You might have to do induction proofs over strings.
 That means you want to establish a property $P(s)$ for all
 strings $s$. For this remember strings are lists of
 characters. These lists can be either the empty list or a
 list of the form $c::s$. If you want to perform an induction
 proof for strings you need to consider the cases
 \end{itemize}
 \noindent
 Given this recipe, I let you show
-\[
+\begin{equation}
 Ders\,s\,(L(r)) = L(ders\,s\,r)
-\]
+\label{dersprop}
+\end{equation}
-\noindent by induction on $s$. In this proof you can
-assume the property for $der$ and $Der$ has already been
+\noindent by induction on $s$. In this proof you can assume
+the following property for $der$ and $Der$ has already been
 proved, that is you can assume
 \[
 L(der\,c\,r) = Der\,c\,(L(r))
 \]
 \noindent holds (this would be of course a property that
 needs to be proved in a side-lemma by induction on $r$).
+To sum up, using reasoning like the one shown above allows us
+to show the correctness of our algorithm. To see this,
+start from the specification
+\[
+s \in L(r)
+\]
+\noindent That is the problem we want to solve. Thinking a
+little, you will see that this problem is equivalent to the
+following problem
+\begin{equation}
+[] \in Ders\,s\,(L(r))
+\label{dersstep}
+\end{equation}
+\noindent But we have shown above in \eqref{dersprop}, that
+the $Ders$ can be replaced by $L(ders\ldots)$. That means
+\eqref{dersstep} is equivalent to
+\begin{equation}
+[] \in L(ders\,s\,r)
+\label{prefinalstep}
+\end{equation}
+\noindent We have also shown that testing whether the empty
+string is in a language is equivalent to the $nullable$
+function; see \eqref{nullableprop}. That means
+\eqref{prefinalstep} is equivalent with
+\[
+nullable(ders\,s\,r)
+\]
+\noindent But this is just the definition of $matches$
+\[
+matches\,s\,r \dn nullable(ders\,s\,r)
+\]
+\noindent In effect we have shown
+\[
+matches\,s\,r\;\;\text{if and only if}\;\;
+s\in L(r)
+\]
+\noindent which is the property we set out to prove:
+our algorithm meets its specification. To have done
+so, requires a few induction proofs about strings and
+regular expressions. Following the recipes is already a big
+step in performing these proofs.
 \end{document}

changeset 343	539b2e88f5b9
parent 340	c49122dbcdd1
child 394	2f9fe225ecc8