afl-material: comparison handouts/ho02.tex

equal deleted inserted replaced

-:1aec0e1fda86
+:1cef3924f7a2
 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016}
 \section*{Handout 2 (Regular Expression Matching)}
-This lecture is about implementing a more efficient regular
+This lecture is about implementing a more efficient regular expression
-expression matcher (the plots on the right)---more efficient
+matcher (the plots on the right)---more efficient than the matchers
-than the matchers from regular expression libraries in Ruby
+from regular expression libraries in Ruby, Python and Java (the plots
-and Python (the plots on the left). These plots show the
+on the left). The first pair of plots show the running time for the
-running time for the evil regular expression
+regular expressions $a^?{}^{\{n\}}\cdot a^{\{n\}}$ and strings composed
-$a^?{}^{\{n\}}\cdot a^{\{n\}}$ and strings composed of $n$
+of $n$ \pcode{a}s. The second pair of plots show the running time
-\pcode{a}s. We will use this regular expression and strings as
+for the regular expression $(a^*)^*\cdot b$ and also strings composed
-running example. To see the substantial differences in the two
+of $n$ \pcode{a}s (meaning this regular expression actually does not
-plots below, note the different scales of the $x$-axes.
+match the strings).  To see the substantial differences in the left
+and right plots below, note the different scales of the $x$-axes.
 \begin{center}
 \begin{tabular}{@{}cc@{}}
 \begin{tikzpicture}
 \begin{axis}[
-xlabel={strings of {\tt a}s},
+xlabel={\small $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$},
-ylabel={time in secs},
+ylabel={\small time in secs},
 enlargelimits=false,
 xtick={0,5,...,30},
 xmax=33,
 ymax=35,
 ytick={0,5,...,30},
 \end{axis}
 \end{tikzpicture}
 &
 \begin{tikzpicture}
 \begin{axis}[
-xlabel={strings of \texttt{a}s},
+xlabel={\small $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$},
+ylabel={\small time in secs},
+enlargelimits=false,
+xtick={0,3000,...,12000},
+xmax=12500,
+ymax=35,
+ytick={0,5,...,30},
+scaled ticks=false,
+axis lines=left,
+width=6.5cm,
+height=5cm]
+\addplot[green,mark=square*,mark options={fill=white}] table {re2b.data};
+\addplot[black,mark=square*,mark options={fill=white}] table {re3.data};
+\end{axis}
+\end{tikzpicture}
+\end{tabular}
+\end{center}
+\begin{center}
+\begin{tabular}{@{}cc@{}}
+\begin{tikzpicture}
+\begin{axis}[
+xlabel={$(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$},
+ylabel={time in secs},
+enlargelimits=false,
+xtick={0,5,...,30},
+xmax=33,
+ymax=35,
+ytick={0,5,...,30},
+scaled ticks=false,
+axis lines=left,
+width=5cm,
+height=5cm,
+legend entries={Java},
+legend pos=north west,
+legend cell align=left]
+\addplot[blue,mark=*, mark options={fill=white}] table {re-java.data};
+\end{axis}
+\end{tikzpicture}
+&
+\begin{tikzpicture}
+\begin{axis}[
+xlabel={$(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$},
 ylabel={time in secs},
 enlargelimits=false,
 xtick={0,3000,...,12000},
 xmax=12500,
 ymax=35,
 \end{axis}
 \end{tikzpicture}
 \end{tabular}
 \end{center}\medskip
+\noindent
-\noindent Having specified in the previous lecture what
+We will use these regular expressions and strings
+as running examples.
+Having specified in the previous lecture what
 problem our regular expression matcher is supposed to solve,
 namely for any given regular expression $r$ and string $s$
 answer \textit{true} if and only if
 \[
 s \in L(r)
 \]
-\noindent we can look at an algorithm to solve this problem.
+\noindent we can look at an algorithm to solve this problem. Clearly
-Clearly we cannot use the function $L$ directly for this,
+we cannot use the function $L$ directly for this, because in general
-because in general the set of strings $L$ returns is infinite
+the set of strings $L$ returns is infinite (recall what $L(a^*)$ is).
-(recall what $L(a^*)$ is). In such cases there is no way we
+In such cases there is no way we can implement an exhaustive test for
-can implement an exhaustive test for whether a string is
+whether a string is member of this set or not. In contrast our
-member of this set or not. In contrast our matching algorithm
+matching algorithm will operate on the regular expression $r$ and
-will operate on the regular expression $r$ and string $s$,
+string $s$, only, which are both finite objects. Before we explain to
-only, which are both finite objects. Before we come to the matching
+the matching algorithm, however, let us have a closer look at what it
-algorithm, however, let us have a closer look at what it means
+means when two regular expressions are equivalent.
-when two regular expressions are equivalent.
 \subsection*{Regular Expression Equivalences}
 We already defined in Handout 1 what it means for two regular
 expressions to be equivalent, namely if their meaning is the
 $\ZERO \cdot r$ & $\equiv$ & $\ZERO$\\
 $r + r$ & $\equiv$ & $r$
 \end{tabular}
 \end{center}
-\noindent which always hold no matter what the regular
+\noindent which always hold no matter what the regular expression $r$
-expression $r$ looks like. The first two are easy to verify
+looks like. The first two are easy to verify since $L(\ZERO)$ is the
-since $L(\ZERO)$ is the empty set. The next two are also
+empty set. The next two are also easy to verify since $L(\ONE) =
-easy to verify since $L(\ONE) = \{[]\}$ and appending the
+\{[]\}$ and appending the empty string to every string of another set,
-empty string to every string of another set, leaves the set
+leaves the set unchanged. Be careful to fully comprehend the fifth and
-unchanged. Be careful to fully comprehend the fifth and sixth
+sixth equivalence: if you concatenate two sets of strings and one is
-equivalence: if you concatenate two sets of strings and one is
+the empty set, then the concatenation will also be the empty set. To
-the empty set, then the concatenation will also be the empty
+see this, check the definition of $\_ @ \_$ for sets. The last
-set. To see this, check the definition of $\_ @ \_$. The
+equivalence is again trivial.
-last equivalence is again trivial.
 What will be important later on is that we can orient these
 equivalences and read them from left to right. In this way we
 can view them as \emph{simplification rules}. Consider for
 example the regular expression
 \begin{equation}
 (r_1 + \ZERO) \cdot \ONE + ((\ONE + r_2) + r_3) \cdot (r_4 \cdot \ZERO)
 \label{big}
 \end{equation}
-\noindent If we can find an equivalent regular expression that
+\noindent If we can find an equivalent regular expression that is
-is simpler (smaller for example), then this might potentially
+simpler (smaller for example), then this might potentially make our
-make our matching algorithm run faster. The reason is that
+matching algorithm run faster. We can look for such a simpler regular
-whether a string $s$ is in $L(r)$ or in $L(r')$ with $r\equiv
+expression $r'$ because whether a string $s$ is in $L(r)$ or in
-r'$ will always give the same answer. In the example above you
+$L(r')$ with $r\equiv r'$ will always give the same answer. In the
-will see that the regular expression is equivalent to just $r_1$.
+example above you will see that the regular expression is equivalent
-You can verify this by iteratively applying the simplification
+to just $r_1$. You can verify this by iteratively applying the
-rules from above:
+simplification rules from above:
 \begin{center}
 \begin{tabular}{ll}
 & $(r_1 + \ZERO) \cdot \ONE + ((\ONE + r_2) + r_3) \cdot
 (\underline{r_4 \cdot \ZERO})$\smallskip\\
 algorithm quite a bit faster.
 \subsection*{The Matching Algorithm}
 The algorithm we will define below consists of two parts. One
-is the function $nullable$ which takes a regular expression as
+is the function $\textit{nullable}$ which takes a regular expression as
 argument and decides whether it can match the empty string
 (this means it returns a boolean in Scala). This can be easily
 defined recursively as follows:
 \begin{center}
 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}
-$nullable(\ZERO)$      & $\dn$ & $\textit{false}$\\
+$\textit{nullable}(\ZERO)$      & $\dn$ & $\textit{false}$\\
-$nullable(\ONE)$         & $\dn$ & $true$\\
+$\textit{nullable}(\ONE)$         & $\dn$ & $\textit{true}$\\
-$nullable(c)$                & $\dn$ & $\textit{false}$\\
+$\textit{nullable}(c)$                & $\dn$ & $\textit{false}$\\
-$nullable(r_1 + r_2)$     & $\dn$ &  $nullable(r_1) \vee nullable(r_2)$\\
+$\textit{nullable}(r_1 + r_2)$     & $\dn$ &  $\textit{nullable}(r_1) \vee \textit{nullable}(r_2)$\\
-$nullable(r_1 \cdot r_2)$ & $\dn$ &  $nullable(r_1) \wedge nullable(r_2)$\\
+$\textit{nullable}(r_1 \cdot r_2)$ & $\dn$ &  $\textit{nullable}(r_1) \wedge \textit{nullable}(r_2)$\\
-$nullable(r^*)$              & $\dn$ & $true$ \\
+$\textit{nullable}(r^*)$              & $\dn$ & $\textit{true}$ \\
 \end{tabular}
 \end{center}
 \noindent The idea behind this function is that the following
 property holds:
 \[
-nullable(r) \;\;\text{if and only if}\;\; []\in L(r)
+\textit{nullable}(r) \;\;\text{if and only if}\;\; []\in L(r)
 \]
 \noindent Note on the left-hand side of the if-and-only-if we
 have a function we can implement; on the right we have its
 specification (which we cannot implement in a programming
 language).
 The other function of our matching algorithm calculates a
 \emph{derivative} of a regular expression. This is a function
 which will take a regular expression, say $r$, and a
-character, say $c$, as argument and returns a new regular
+character, say $c$, as arguments and returns a new regular
 expression. Be careful that the intuition behind this function
 is not so easy to grasp on first reading. Essentially this
 function solves the following problem: if $r$ can match a
 string of the form $c\!::\!s$, what does the regular
 expression look like that can match just $s$? The definition
 \begin{tabular}{l@ {\hspace{2mm}}c@ {\hspace{2mm}}l}
 $der\, c\, (\ZERO)$      & $\dn$ & $\ZERO$\\
 $der\, c\, (\ONE)$         & $\dn$ & $\ZERO$ \\
 $der\, c\, (d)$                & $\dn$ & if $c = d$ then $\ONE$ else $\ZERO$\\
 $der\, c\, (r_1 + r_2)$        & $\dn$ & $der\, c\, r_1 + der\, c\, r_2$\\
-$der\, c\, (r_1 \cdot r_2)$  & $\dn$  & if $nullable (r_1)$\\
+$der\, c\, (r_1 \cdot r_2)$  & $\dn$  & if $\textit{nullable} (r_1)$\\
 & & then $(der\,c\,r_1) \cdot r_2 + der\, c\, r_2$\\
 & & else $(der\, c\, r_1) \cdot r_2$\\
 $der\, c\, (r^*)$          & $\dn$ & $(der\,c\,r) \cdot (r^*)$
 \end{tabular}
 \end{center}
 involved. Fortunately, the $+$-case is still relatively
 straightforward: all strings of the form $c\!::\!s$ are either
 matched by the regular expression $r_1$ or $r_2$. So we just
 have to recursively call $der$ with these two regular
 expressions and compose the results again with $+$. Makes
-sense? The $\cdot$-case is more complicated: if $r_1\cdot r_2$
+sense?
+The $\cdot$-case is more complicated: if $r_1\cdot r_2$
 matches a string of the form $c\!::\!s$, then the first part
 must be matched by $r_1$. Consequently, it makes sense to
 construct the regular expression for $s$ by calling $der$ with
 $r_1$ and ``appending'' $r_2$. There is however one exception
 to this simple rule: if $r_1$ can match the empty string, then
 is again simple: if $r^*$ matches a string of the form
 $c\!::\!s$, then the first part must be ``matched'' by a
 single copy of $r$. Therefore we call recursively $der\,c\,r$
 and ``append'' $r^*$ in order to match the rest of $s$.
-If this did not make sense, here is another way to rationalise
+If this did not make sense yet, here is another way to rationalise
 the definition of $der$ by considering the following operation
 on sets:
 \begin{equation}\label{Der}
 Der\,c\,A\;\dn\;\{s\,|\,c\!::\!s \in A\}
 \begin{tabular}{rll}
 Input: $r_1$, $abc$\medskip\\
 Step 1: & build derivative of $a$ and $r_1$ & $(r_2 = der\,a\,r_1)$\smallskip\\
 Step 2: & build derivative of $b$ and $r_2$ & $(r_3 = der\,b\,r_2)$\smallskip\\
 Step 3: & build derivative of $c$ and $r_3$ & $(r_4 = der\,b\,r_3)$\smallskip\\
-Step 4: & the string is exhausted; test & ($nullable(r_4)$)\\
+Step 4: & the string is exhausted; test & ($\textit{nullable}(r_4)$)\\
 & whether $r_4$ can recognise the\\
 & empty string\smallskip\\
-Output: & result of this test $\Rightarrow true \,\text{or}\, \textit{false}$\\
+Output: & result of this test $\Rightarrow \textit{true} \,\text{or}\, \textit{false}$\\
 \end{tabular}
 \end{center}
 \noindent Again the operation $Der$ might help to rationalise
 this algorithm. We want to know whether $abc \in L(r_1)$. We
 do not know yet---but let us assume it is. Then $Der\,a\,L(r_1)$
 builds the set where all the strings not starting with $a$ are
 filtered out. Of the remaining strings, the $a$ is stripped
-off. Then we continue with filtering out all strings not
+off. So we should still have $bc$ in the set.
+Then we continue with filtering out all strings not
 starting with $b$ and stripping off the $b$ from the remaining
 strings, that means we build $Der\,b\,(Der\,a\,(L(r_1)))$.
 Finally we filter out all strings not starting with $c$ and
 strip off $c$ from the remaining string. This is
 $Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$. Now if $abc$ was in the
 original set ($L(r_1)$), then in $Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$
-must be the empty string. If not, then $abc$ was not in the
+must contain the empty string. If not, then $abc$ was not in the
 language we started with.
-Our matching algorithm using $der$ and $nullable$ works
+Our matching algorithm using $der$ and $\textit{nullable}$ works
 similarly, just using regular expression instead of sets. For
 this we need to extend the notion of derivatives from single
 characters to strings. This can be done using the following
 function, taking a string and regular expression as input and
 a regular expression as output.
 the time from the original string until it is exhausted.
 Having $ders$ in place, we can finally define our matching
 algorithm:
 \[
-matches\,s\,r \dn nullable(ders\,s\,r)
+matches\,s\,r \dn \textit{nullable}(ders\,s\,r)
 \]
 \noindent
 and we can claim that
 in the first lecture and handout, the functions and subfunctions
 for \pcode{matches} are shown in Figure~\ref{scala1}.
 \begin{figure}[p]
 \lstinputlisting{../progs/app5.scala}
-\caption{Scala implementation of the nullable and
+\caption{Scala implementation of the \textit{nullable} and
 derivative functions. These functions are easy to
-implement in functional languages, because pattern
+implement in functional languages, because their built-in pattern
 matching and recursion allow us to mimic the mathematical
 definitions very closely.\label{scala1}}
 \end{figure}
 For running the algorithm with our favourite example, the evil
 \end{center}
 \noindent With this fix we have a constant ``size'' regular
 expression for our running example no matter how large $n$ is.
 This means we have to also add cases for \pcode{NTIMES} in the
-functions $nullable$ and $der$. Does the change have any
+functions $\textit{nullable}$ and $der$. Does the change have any
 effect?
 \begin{center}
 \begin{tikzpicture}
 \begin{axis}[
 can within 30 seconds handle regular expressions up to
 $n = 950$ before a StackOverflow is raised. Python and Ruby
 (and our first version) could only handle $n = 27$ or so in 30
 second.
+SECOND EXAMPLE
 The moral is that our algorithm is rather sensitive to the
 size of regular expressions it needs to handle. This is of
-course obvious because both $nullable$ and $der$ frequently
+course obvious because both $\textit{nullable}$ and $der$ frequently
 need to traverse the whole regular expression. There seems,
 however, one more issue for making the algorithm run faster.
 The derivative function often produces ``useless''
 $\ZERO$s and $\ONE$s. To see this, consider $r = ((a
 \cdot b) + b)^*$ and the following two derivatives
 \addplot[black,mark=square*,mark options={fill=white}] table {re3.data};
 \end{axis}
 \end{tikzpicture}
 \end{center}
+SECOND EXAMPLE
 \section*{Proofs}
 You might not like doing proofs. But they serve a very
 important purpose in Computer Science: How can we be sure that
 our algorithm matches its specification. We can try to test
 \noindent
 A simple proof is for example showing the following
 property:
 \begin{equation}
-nullable(r) \;\;\text{if and only if}\;\; []\in L(r)
+\textit{nullable}(r) \;\;\text{if and only if}\;\; []\in L(r)
 \label{nullableprop}
 \end{equation}
 \noindent
 Let us say that this property is $P(r)$, then the first case
 we need to check is whether $P(\ZERO)$ (see recipe
 above). So we have to show that
 \[
-nullable(\ZERO) \;\;\text{if and only if}\;\;
+\textit{nullable}(\ZERO) \;\;\text{if and only if}\;\;
 []\in L(\ZERO)
 \]
-\noindent whereby $nullable(\ZERO)$ is by definition of
+\noindent whereby $\textit{nullable}(\ZERO)$ is by definition of
-the function $nullable$ always $\textit{false}$. We also have
+the function $\textit{nullable}$ always $\textit{false}$. We also have
 that $L(\ZERO)$ is by definition $\{\}$. It is
 impossible that the empty string $[]$ is in the empty set.
 Therefore also the right-hand side is false. Consequently we
 verified this case: both sides are false. We would still need
 to do this for $P(\ONE)$ and $P(c)$. I leave this to
 Next we need to check the inductive cases, for example
 $P(r_1 + r_2)$, which is
 \begin{equation}
-nullable(r_1 + r_2) \;\;\text{if and only if}\;\;
+\textit{nullable}(r_1 + r_2) \;\;\text{if and only if}\;\;
 []\in L(r_1 + r_2)
 \label{propalt}
 \end{equation}
 \noindent The difference to the base cases is that in this
 case we can already assume we proved
 \begin{center}
 \begin{tabular}{l}
-$nullable(r_1) \;\;\text{if and only if}\;\; []\in L(r_1)$ and\\
+$\textit{nullable}(r_1) \;\;\text{if and only if}\;\; []\in L(r_1)$ and\\
-$nullable(r_2) \;\;\text{if and only if}\;\; []\in L(r_2)$\\
+$\textit{nullable}(r_2) \;\;\text{if and only if}\;\; []\in L(r_2)$\\
 \end{tabular}
 \end{center}
 \noindent These are the induction hypotheses. To check this
-case, we can start from $nullable(r_1 + r_2)$, which by
+case, we can start from $\textit{nullable}(r_1 + r_2)$, which by
 definition is
 \[
-nullable(r_1) \vee nullable(r_2)
+\textit{nullable}(r_1) \vee \textit{nullable}(r_2)
 \]
 \noindent Using the two induction hypotheses from above,
 we can transform this into
 \[
 [] \in L(r_1) \vee []\in(r_2)
 \]
-\noindent We just replaced the $nullable(\ldots)$ parts by
+\noindent We just replaced the $\textit{nullable}(\ldots)$ parts by
 the equivalent $[] \in L(\ldots)$ from the induction
 hypotheses. A bit of thinking convinces you that if
 $[] \in L(r_1) \vee []\in L(r_2)$ then the empty string
 must be in the union $L(r_1)\cup L(r_2)$, that is
 \]
 \noindent but this is by definition of $L$ exactly $[] \in
 L(r_1 + r_2)$, which we needed to establish according to
 \eqref{propalt}. What we have shown is that starting from
-$nullable(r_1 + r_2)$ we have done equivalent transformations
+$\textit{nullable}(r_1 + r_2)$ we have done equivalent transformations
 to end up with $[] \in L(r_1 + r_2)$. Consequently we have
 established that $P(r_1 + r_2)$ holds.
 In order to complete the proof we would now need to look
 at the cases \mbox{$P(r_1\cdot r_2)$} and $P(r^*)$. Again I let you
 [] \in L(ders\,s\,r)
 \label{prefinalstep}
 \end{equation}
 \noindent We have also shown that testing whether the empty
-string is in a language is equivalent to the $nullable$
+string is in a language is equivalent to the $\textit{nullable}$
 function; see \eqref{nullableprop}. That means
 \eqref{prefinalstep} is equivalent with
 \[
-nullable(ders\,s\,r)
+\textit{nullable}(ders\,s\,r)
 \]
 \noindent But this is just the definition of $matches$
 \[

changeset 412	1cef3924f7a2
parent 399	5c1fbb39c93e
child 414	065ca01b62ae