afl-material: comparison handouts/ho02.tex

equal deleted inserted replaced

-:b7221df9662a
+:065ca01b62ae
 match the strings).  To see the substantial differences in the left
 and right plots below, note the different scales of the $x$-axes.
 \begin{center}
+Plots: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$\\
 \begin{tabular}{@{}cc@{}}
 \begin{tikzpicture}
 \begin{axis}[
-xlabel={\small $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$},
+xlabel={$n$},
+x label style={at={(1.05,0.0)}},
 ylabel={\small time in secs},
 enlargelimits=false,
 xtick={0,5,...,30},
 xmax=33,
 ymax=35,
 \end{axis}
 \end{tikzpicture}
 &
 \begin{tikzpicture}
 \begin{axis}[
-xlabel={\small $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$},
+xlabel={$n$},
+x label style={at={(1.1,0.05)}},
 ylabel={\small time in secs},
 enlargelimits=false,
 xtick={0,3000,...,12000},
 xmax=12500,
 ymax=35,
 \end{tikzpicture}
 \end{tabular}
 \end{center}
 \begin{center}
+Plots: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$
 \begin{tabular}{@{}cc@{}}
 \begin{tikzpicture}
 \begin{axis}[
-xlabel={$(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$},
+xlabel={$n$},
+x label style={at={(1.05,0.0)}},
 ylabel={time in secs},
 enlargelimits=false,
 xtick={0,5,...,30},
 xmax=33,
 ymax=35,
 \end{axis}
 \end{tikzpicture}
 &
 \begin{tikzpicture}
 \begin{axis}[
-xlabel={$(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$},
+xlabel={$n$},
+x label style={at={(1.1,0.05)}},
 ylabel={time in secs},
 enlargelimits=false,
 xtick={0,3000,...,12000},
 xmax=12500,
 ymax=35,
 we cannot use the function $L$ directly for this, because in general
 the set of strings $L$ returns is infinite (recall what $L(a^*)$ is).
 In such cases there is no way we can implement an exhaustive test for
 whether a string is member of this set or not. In contrast our
 matching algorithm will operate on the regular expression $r$ and
-string $s$, only, which are both finite objects. Before we explain to
+string $s$, only, which are both finite objects. Before we explain
 the matching algorithm, however, let us have a closer look at what it
 means when two regular expressions are equivalent.
 \subsection*{Regular Expression Equivalences}
 expression look like that can match just $s$? The definition
 of this function is as follows:
 \begin{center}
 \begin{tabular}{l@ {\hspace{2mm}}c@ {\hspace{2mm}}l}
-$der\, c\, (\ZERO)$      & $\dn$ & $\ZERO$\\
+$\textit{der}\, c\, (\ZERO)$      & $\dn$ & $\ZERO$\\
-$der\, c\, (\ONE)$         & $\dn$ & $\ZERO$ \\
+$\textit{der}\, c\, (\ONE)$         & $\dn$ & $\ZERO$ \\
-$der\, c\, (d)$                & $\dn$ & if $c = d$ then $\ONE$ else $\ZERO$\\
+$\textit{der}\, c\, (d)$                & $\dn$ & if $c = d$ then $\ONE$ else $\ZERO$\\
-$der\, c\, (r_1 + r_2)$        & $\dn$ & $der\, c\, r_1 + der\, c\, r_2$\\
+$\textit{der}\, c\, (r_1 + r_2)$        & $\dn$ & $\textit{der}\, c\, r_1 + \textit{der}\, c\, r_2$\\
-$der\, c\, (r_1 \cdot r_2)$  & $\dn$  & if $\textit{nullable} (r_1)$\\
+$\textit{der}\, c\, (r_1 \cdot r_2)$  & $\dn$  & if $\textit{nullable} (r_1)$\\
-& & then $(der\,c\,r_1) \cdot r_2 + der\, c\, r_2$\\
+& & then $(\textit{der}\,c\,r_1) \cdot r_2 + \textit{der}\, c\, r_2$\\
-& & else $(der\, c\, r_1) \cdot r_2$\\
+& & else $(\textit{der}\, c\, r_1) \cdot r_2$\\
-$der\, c\, (r^*)$          & $\dn$ & $(der\,c\,r) \cdot (r^*)$
+$\textit{der}\, c\, (r^*)$          & $\dn$ & $(\textit{der}\,c\,r) \cdot (r^*)$
 \end{tabular}
 \end{center}
 \noindent The first two clauses can be rationalised as
-follows: recall that $der$ should calculate a regular
+follows: recall that $\textit{der}$ should calculate a regular
 expression so that given the ``input'' regular expression can
 match a string of the form $c\!::\!s$, we want a regular
 expression for $s$. Since neither $\ZERO$ nor $\ONE$
 can match a string of the form $c\!::\!s$, we return
 $\ZERO$. In the third case we have to make a
 return $\ZERO$ since no string of the $c\!::\!s$ can be
 matched. Next come the recursive cases, which are a bit more
 involved. Fortunately, the $+$-case is still relatively
 straightforward: all strings of the form $c\!::\!s$ are either
 matched by the regular expression $r_1$ or $r_2$. So we just
-have to recursively call $der$ with these two regular
+have to recursively call $\textit{der}$ with these two regular
 expressions and compose the results again with $+$. Makes
 sense?
 The $\cdot$-case is more complicated: if $r_1\cdot r_2$
 matches a string of the form $c\!::\!s$, then the first part
 must be matched by $r_1$. Consequently, it makes sense to
-construct the regular expression for $s$ by calling $der$ with
+construct the regular expression for $s$ by calling $\textit{der}$ with
 $r_1$ and ``appending'' $r_2$. There is however one exception
 to this simple rule: if $r_1$ can match the empty string, then
 all of $c\!::\!s$ is matched by $r_2$. So in case $r_1$ is
 nullable (that is can match the empty string) we have to allow
-the choice $der\,c\,r_2$ for calculating the regular
+the choice $\textit{der}\,c\,r_2$ for calculating the regular
 expression that can match $s$. Therefore we have to add the
-regular expression $der\,c\,r_2$ in the result. The $*$-case
+regular expression $\textit{der}\,c\,r_2$ in the result. The $*$-case
 is again simple: if $r^*$ matches a string of the form
 $c\!::\!s$, then the first part must be ``matched'' by a
-single copy of $r$. Therefore we call recursively $der\,c\,r$
+single copy of $r$. Therefore we call recursively $\textit{der}\,c\,r$
-and ``append'' $r^*$ in order to match the rest of $s$.
+and ``append'' $r^*$ in order to match the rest of $s$. Still
+makes sense?
-If this did not make sense yet, here is another way to rationalise
-the definition of $der$ by considering the following operation
+If all this did not make sense yet, here is another way to rationalise
+the definition of $\textit{der}$ by considering the following operation
 on sets:
 \begin{equation}\label{Der}
-Der\,c\,A\;\dn\;\{s\,|\,c\!::\!s \in A\}
+\textit{Der}\,c\,A\;\dn\;\{s\,|\,c\!::\!s \in A\}
 \end{equation}
 \noindent This operation essentially transforms a set of
 strings $A$ by filtering out all strings that do not start
 with $c$ and then strips off the $c$ from all the remaining
 strings. For example suppose $A = \{f\!oo, bar, f\!rak\}$ then
-\[ Der\,f\,A = \{oo, rak\}\quad,\quad
+\[ \textit{Der}\,f\,A = \{oo, rak\}\quad,\quad
-Der\,b\,A = \{ar\} \quad \text{and} \quad
+\textit{Der}\,b\,A = \{ar\} \quad \text{and} \quad
-Der\,a\,A = \{\}
+\textit{Der}\,a\,A = \{\}
 \]
 \noindent
-Note that in the last case $Der$ is empty, because no string in $A$
+Note that in the last case $\textit{Der}$ is empty, because no string in $A$
 starts with $a$. With this operation we can state the following
-property about $der$:
+property about $\textit{der}$:
 \[
-L(der\,c\,r) = Der\,c\,(L(r))
+L(\textit{der}\,c\,r) = \textit{Der}\,c\,(L(r))
 \]
 \noindent
-This property clarifies what regular expression $der$ calculates,
+This property clarifies what regular expression $\textit{der}$ calculates,
 namely take the set of strings that $r$ can match (that is $L(r)$),
 filter out all strings not starting with $c$ and strip off the $c$
 from the remaining strings---this is exactly the language that
-$der\,c\,r$ can match.
+$\textit{der}\,c\,r$ can match.
 If we want to find out whether the string $abc$ is matched by
-the regular expression $r_1$ then we can iteratively apply $der$
+the regular expression $r_1$ then we can iteratively apply $\textit{der}$
 as follows
 \begin{center}
 \begin{tabular}{rll}
 Input: $r_1$, $abc$\medskip\\
-Step 1: & build derivative of $a$ and $r_1$ & $(r_2 = der\,a\,r_1)$\smallskip\\
+Step 1: & build derivative of $a$ and $r_1$ & $(r_2 = \textit{der}\,a\,r_1)$\smallskip\\
-Step 2: & build derivative of $b$ and $r_2$ & $(r_3 = der\,b\,r_2)$\smallskip\\
+Step 2: & build derivative of $b$ and $r_2$ & $(r_3 = \textit{der}\,b\,r_2)$\smallskip\\
-Step 3: & build derivative of $c$ and $r_3$ & $(r_4 = der\,b\,r_3)$\smallskip\\
+Step 3: & build derivative of $c$ and $r_3$ & $(r_4 = \textit{der}\,b\,r_3)$\smallskip\\
 Step 4: & the string is exhausted; test & ($\textit{nullable}(r_4)$)\\
 & whether $r_4$ can recognise the\\
 & empty string\smallskip\\
 Output: & result of this test $\Rightarrow \textit{true} \,\text{or}\, \textit{false}$\\
 \end{tabular}
 \end{center}
-\noindent Again the operation $Der$ might help to rationalise
+\noindent Again the operation $\textit{Der}$ might help to rationalise
 this algorithm. We want to know whether $abc \in L(r_1)$. We
-do not know yet---but let us assume it is. Then $Der\,a\,L(r_1)$
+do not know yet---but let us assume it is. Then $\textit{Der}\,a\,L(r_1)$
 builds the set where all the strings not starting with $a$ are
 filtered out. Of the remaining strings, the $a$ is stripped
 off. So we should still have $bc$ in the set.
 Then we continue with filtering out all strings not
 starting with $b$ and stripping off the $b$ from the remaining
-strings, that means we build $Der\,b\,(Der\,a\,(L(r_1)))$.
+strings, that means we build $\textit{Der}\,b\,(\textit{Der}\,a\,(L(r_1)))$.
 Finally we filter out all strings not starting with $c$ and
 strip off $c$ from the remaining string. This is
-$Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$. Now if $abc$ was in the
+$\textit{Der}\,c\,(\textit{Der}\,b\,(\textit{Der}\,a\,(L(r_1))))$. Now if $abc$ was in the
-original set ($L(r_1)$), then in $Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$
+original set ($L(r_1)$), then $\textit{Der}\,c\,(\textit{Der}\,b\,(\textit{Der}\,a\,(L(r_1))))$
 must contain the empty string. If not, then $abc$ was not in the
 language we started with.
-Our matching algorithm using $der$ and $\textit{nullable}$ works
+Our matching algorithm using $\textit{der}$ and $\textit{nullable}$ works
-similarly, just using regular expression instead of sets. For
+similarly, just using regular expression instead of sets. In order to
-this we need to extend the notion of derivatives from single
+define our algorithm we need to extend the notion of derivatives from single
 characters to strings. This can be done using the following
-function, taking a string and regular expression as input and
+function, taking a string and a regular expression as input and
 a regular expression as output.
 \begin{center}
 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {\hspace{-10mm}}l@ {}}
 $\textit{ders}\, []\, r$     & $\dn$ & $r$ & \\
-$\textit{ders}\, (c\!::\!s)\, r$ & $\dn$ & $\textit{ders}\,s\,(der\,c\,r)$ & \\
+$\textit{ders}\, (c\!::\!s)\, r$ & $\dn$ & $\textit{ders}\,s\,(\textit{der}\,c\,r)$ & \\
 \end{tabular}
 \end{center}
-\noindent This function iterates $der$ taking one character at
+\noindent This function iterates $\textit{der}$ taking one character at
 the time from the original string until it is exhausted.
-Having $ders$ in place, we can finally define our matching
+Having $\textit{der}s$ in place, we can finally define our matching
 algorithm:
 \[
-matches\,s\,r \dn \textit{nullable}(ders\,s\,r)
+\textit{matches}\,s\,r \dn \textit{nullable}(\textit{ders}\,s\,r)
 \]
 \noindent
 and we can claim that
 \[
-matches\,s\,r\quad\text{if and only if}\quad s\in L(r)
+\textit{matches}\,s\,r\quad\text{if and only if}\quad s\in L(r)
 \]
 \noindent holds, which means our algorithm satisfies the
 specification. Of course we can claim many things\ldots
 whether the claim holds any water is a different question,
 which for example is the point of the Strand-2 Coursework.
-This algorithm was introduced by Janus Brzozowski in 1964. Its
+This algorithm was introduced by Janus Brzozowski in 1964, but
+is more widely known only in the last 10 or so years. Its
 main attractions are simplicity and being fast, as well as
 being easily extendable for other regular expressions such as
 $r^{\{n\}}$, $r^?$, $\sim{}r$ and so on (this is subject of
 Strand-1 Coursework 1).
 \subsection*{The Matching Algorithm in Scala}
 Another attraction of the algorithm is that it can be easily
 implemented in a functional programming language, like Scala.
 implement in functional languages, because their built-in pattern
 matching and recursion allow us to mimic the mathematical
 definitions very closely.\label{scala1}}
 \end{figure}
-For running the algorithm with our favourite example, the evil
+Remember our second example involving the regular expression
+$(a^*)^* \cdot b$ which could not match strings of $n$ \texttt{a}s.
+Java needed around 30 seconds to find this out a string with $n=28$.
+It seems our algorithm is doing rather well in comparison:
+\begin{center}
+\begin{tikzpicture}
+\begin{axis}[
+title={Plot: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$},
+xlabel={$n$},
+x label style={at={(1.05,0.0)}},
+ylabel={time in secs},
+enlargelimits=false,
+xtick={0,500,...,4000},
+xmax=4100,
+ytick={0,5,...,30},
+scaled ticks=false,
+axis lines=left,
+width=10cm,
+height=4.5cm,
+legend entries={Java,Scala V1},
+legend pos=north east,
+legend cell align=left]
+\addplot[blue,mark=*, mark options={fill=white}]
+table {re-java.data};
+\addplot[red,mark=triangle*,mark options={fill=white}]
+table {re2c.data};
+\end{axis}
+\end{tikzpicture}
+\end{center}
+\noindent
+This is no error: it hardly takes more than half a second for
+strings up to the length of 4000. After that we receive a
+StackOverflow exception, but still\ldots
+For running the algorithm with our first example, the evil
 regular expression $a^?{}^{\{n\}}a^{\{n\}}$, we need to implement
 the optional regular expression and the exactly $n$-times
 regular expression. This can be done with the translations
 \lstinputlisting[numbers=none]{../progs/app51.scala}
-\noindent Running the matcher with the example, we find it is
+\noindent Running the matcher with this example, we find it is
 slightly worse then the matcher in Ruby and Python.
 Ooops\ldots
 \begin{center}
 \begin{tikzpicture}
 \begin{axis}[
-xlabel={\pcode{a}s},
+title={Plot: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$},
+xlabel={$n$},
+x label style={at={(1.05,0.0)}},
 ylabel={time in secs},
 enlargelimits=false,
 xtick={0,5,...,30},
 xmax=30,
 ytick={0,5,...,30},
 \end{center}
 \noindent With this fix we have a constant ``size'' regular
 expression for our running example no matter how large $n$ is.
 This means we have to also add cases for \pcode{NTIMES} in the
-functions $\textit{nullable}$ and $der$. Does the change have any
+functions $\textit{nullable}$ and $\textit{der}$. Does the change have any
 effect?
 \begin{center}
 \begin{tikzpicture}
 \begin{axis}[
-xlabel={\pcode{a}s},
+title={Plot: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$},
+xlabel={$n$},
+x label style={at={(1.01,0.0)}},
 ylabel={time in secs},
 enlargelimits=false,
 xtick={0,100,...,1000},
-xmax=1000,
+xmax=1100,
 ytick={0,5,...,30},
 scaled ticks=false,
 axis lines=left,
-width=9.5cm,
+width=10cm,
 height=5cm,
 legend entries={Python,Ruby,Scala V1,Scala V2},
 legend pos=outer north east,
-legend cell align=left
+legend cell align=left]
-]
 \addplot[blue,mark=*, mark options={fill=white}]
 table {re-python.data};
 \addplot[brown,mark=pentagon*, mark options={fill=white}]
 table {re-ruby.data};
 \addplot[red,mark=triangle*,mark options={fill=white}]
 \end{tikzpicture}
 \end{center}
 \noindent Now we are talking business! The modified matcher
 can within 30 seconds handle regular expressions up to
-$n = 950$ before a StackOverflow is raised. Python and Ruby
+$n = 950$ before a StackOverflow is raised. Recall that Python and Ruby
-(and our first version) could only handle $n = 27$ or so in 30
+(and our first version, Scala V1) could only handle $n = 27$ or so in 30
-second.
+seconds. There is no change for our second example
+$(a^*)^* \cdot b$---so this is still good.
-SECOND EXAMPLE
 The moral is that our algorithm is rather sensitive to the
 size of regular expressions it needs to handle. This is of
-course obvious because both $\textit{nullable}$ and $der$ frequently
+course obvious because both $\textit{nullable}$ and $\textit{der}$ frequently
 need to traverse the whole regular expression. There seems,
 however, one more issue for making the algorithm run faster.
 The derivative function often produces ``useless''
 $\ZERO$s and $\ONE$s. To see this, consider $r = ((a
 \cdot b) + b)^*$ and the following two derivatives
 \begin{center}
 \begin{tabular}{l}
-$der\,a\,r = ((\ONE \cdot b) + \ZERO) \cdot r$\\
+$\textit{der}\,a\,r = ((\ONE \cdot b) + \ZERO) \cdot r$\\
-$der\,b\,r = ((\ZERO \cdot b) + \ONE)\cdot r$\\
+$\textit{der}\,b\,r = ((\ZERO \cdot b) + \ONE)\cdot r$\\
-$der\,c\,r = ((\ZERO \cdot b) + \ZERO)\cdot r$
+$\textit{der}\,c\,r = ((\ZERO \cdot b) + \ZERO)\cdot r$
 \end{tabular}
 \end{center}
 \noindent
 If we simplify them according to the simple rules from the
 beginning, we can replace the right-hand sides by the
 smaller equivalent regular expressions
 \begin{center}
 \begin{tabular}{l}
-$der\,a\,r \equiv b \cdot r$\\
+$\textit{der}\,a\,r \equiv b \cdot r$\\
-$der\,b\,r \equiv r$\\
+$\textit{der}\,b\,r \equiv r$\\
-$der\,c\,r \equiv \ZERO$
+$\textit{der}\,c\,r \equiv \ZERO$
 \end{tabular}
 \end{center}
 \noindent I leave it to you to contemplate whether such a
 simplification can have any impact on the correctness of our
 Line~\ref{simpline}.\label{scala2}}
 \end{figure}
 \begin{center}
 \begin{tikzpicture}
-\begin{axis}[xlabel={\pcode{a}s},ylabel={time in secs},
+\begin{axis}[
+title={Plot: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$},
+xlabel={$n$},
+x label style={at={(1.04,0.0)}},
+ylabel={time in secs},
 enlargelimits=false,
 xtick={0,2000,...,12000},
-xmax=12000,
+xmax=13000,
 ytick={0,5,...,30},
 scaled ticks=false,
 axis lines=left,
 width=9cm,
 height=5cm,
 \addplot[black,mark=square*,mark options={fill=white}] table {re3.data};
 \end{axis}
 \end{tikzpicture}
 \end{center}
-SECOND EXAMPLE
+\begin{center}
+\begin{tikzpicture}
+\begin{axis}[
+title={Plot: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$},
+xlabel={$n$},
+x label style={at={(1.09,0.0)}},
+ylabel={time in secs},
+enlargelimits=false,
+%xtick={0,1e+6,...,6e+6},
+xmax=6200000,
+ytick={0,5,...,30},
+ymax=33,
+scaled ticks=false,
+axis lines=left,
+width=9cm,
+height=5cm,
+legend entries={Scala V3}]
+\addplot[black,mark=square*,mark options={fill=white}] table {re3a.data};
+\end{axis}
+\end{tikzpicture}
+\end{center}
+\section{Epilogue}
+TBD
 \section*{Proofs}
 You might not like doing proofs. But they serve a very
 important purpose in Computer Science: How can we be sure that
 \noindent
 Given this recipe, I let you show
 \begin{equation}
-Ders\,s\,(L(r)) = L(ders\,s\,r)
+\textit{Ders}\,s\,(L(r)) = L(\textit{ders}\,s\,r)
 \label{dersprop}
 \end{equation}
-\noindent by induction on $s$. Recall $Der$ is defined for
+\noindent by induction on $s$. Recall $\textit{Der}$ is defined for
-character---see \eqref{Der}; $Ders$ is similar, but for strings:
+character---see \eqref{Der}; $\textit{Ders}$ is similar, but for strings:
 \[
-Ders\,s\,A\;\dn\;\{s'\,|\,s @ s' \in A\}
+\textit{Ders}\,s\,A\;\dn\;\{s'\,|\,s @ s' \in A\}
 \]
 \noindent In this proof you can assume the following property
-for $der$ and $Der$ has already been proved, that is you can
+for $der$ and $\textit{Der}$ has already been proved, that is you can
 assume
 \[
-L(der\,c\,r) = Der\,c\,(L(r))
+L(\textit{der}\,c\,r) = \textit{Der}\,c\,(L(r))
 \]
 \noindent holds (this would be of course a property that
 needs to be proved in a side-lemma by induction on $r$).
 \noindent That is the problem we want to solve. Thinking a
 little, you will see that this problem is equivalent to the
 following problem
 \begin{equation}
-[] \in Ders\,s\,(L(r))
+[] \in \textit{Ders}\,s\,(L(r))
 \label{dersstep}
 \end{equation}
 \noindent But we have shown above in \eqref{dersprop}, that
-the $Ders$ can be replaced by $L(ders\ldots)$. That means
+the $\textit{Ders}$ can be replaced by $L(\textit{ders}\ldots)$. That means
 \eqref{dersstep} is equivalent to
 \begin{equation}
-[] \in L(ders\,s\,r)
+[] \in L(\textit{ders}\,s\,r)
 \label{prefinalstep}
 \end{equation}
 \noindent We have also shown that testing whether the empty
 string is in a language is equivalent to the $\textit{nullable}$
 function; see \eqref{nullableprop}. That means
 \eqref{prefinalstep} is equivalent with
 \[
-\textit{nullable}(ders\,s\,r)
+\textit{nullable}(\textit{ders}\,s\,r)
 \]
 \noindent But this is just the definition of $matches$
 \[
-matches\,s\,r \dn nullable(ders\,s\,r)
+matches\,s\,r \dn nullable(\textit{ders}\,s\,r)
 \]
 \noindent In effect we have shown
 \[

changeset 414	065ca01b62ae
parent 412	1cef3924f7a2
child 415	4ae59fd3b174