afl-material: comparison handouts/ho02.tex

equal deleted inserted replaced

-:6cb517754d8a
+:794c599cee53
 \section*{Handout 2 (Regular Expression Matching)}
 This lecture is about implementing a more efficient regular
 expression matcher (the plots on the right)---more efficient
-than the matchers from regular expression libraries in Ruby and
+than the matchers from regular expression libraries in Ruby
-Python (the plots on the left). These plots show the running
+and Python (the plots on the left). These plots show the
-time for the evil regular expression $a?^{\{n\}}a^{\{n\}}$.
+running time for the evil regular expression
-Note the different scales of the $x$-axes.
+$a?^{\{n\}}a^{\{n\}}$ and string composed of $n$ \pcode{a}s.
+We will use this regular expression and strings as running
+example. To see the substantial differences in the two plots
+below, note the different scales of the $x$-axes.
 \begin{center}
 \begin{tabular}{@{}cc@{}}
 \begin{tikzpicture}
 \end{tabular}
 \end{center}\medskip
 \noindent Having specified in the previous lecture what
-problem our regular expression matcher, which we will call
+problem our regular expression matcher is supposed to solve,
-\pcode{matches}, is supposed to solve, namely for any given
+namely for any given regular expression $r$ and string $s$
-regular expression $r$ and string $s$ answer \textit{true} if
+answer \textit{true} if and only if
-and only if
 \[
 s \in L(r)
 \]
 Clearly we cannot use the function $L$ directly for this,
 because in general the set of strings $L$ returns is infinite
 (recall what $L(a^*)$ is). In such cases there is no way we
 can implement an exhaustive test for whether a string is
 member of this set or not. In contrast our matching algorithm
-will mainly operate on the regular expression $r$ and string
+will operate on the regular expression $r$ and string $s$,
-$s$, which are both finite. Before we come to the matching
+only, which are both finite. Before we come to the matching
 algorithm, however, let us have a closer look at what it means
 when two regular expressions are equivalent.
 \subsection*{Regular Expression Equivalences}
 $r + r$ & $\equiv$ & $r$
 \end{tabular}
 \end{center}
 \noindent which always hold no matter what the regular
-expression $r$ looks like. The first two are easy to verify since
+expression $r$ looks like. The first two are easy to verify
-$L(\varnothing)$ is the empty set. The next two are also easy
+since $L(\varnothing)$ is the empty set. The next two are also
-to verify since $L(\epsilon) = \{[]\}$ and appending the empty
+easy to verify since $L(\epsilon) = \{[]\}$ and appending the
-string to every string of another set, leaves the set
+empty string to every string of another set, leaves the set
-unchanged. Be careful to fully comprehend the fifth and
+unchanged. Be careful to fully comprehend the fifth and sixth
-sixth equivalence: if you concatenate two sets of strings
+equivalence: if you concatenate two sets of strings and one is
-and one is the empty set, then the concatenation will also be
+the empty set, then the concatenation will also be the empty
-the empty set. Check the definition of \pcode{_ @ _}.
+set. To see this, check the definition of \pcode{_ @ _}. The
-The last equivalence is again trivial.
+last equivalence is again trivial.
 What will be important later on is that we can orient these
 equivalences and read them from left to right. In this way we
-can view them as \emph{simplification rules}. Suppose for
+can view them as \emph{simplification rules}. Consider for
 example the regular expression
 \begin{equation}
 (r_1 + \varnothing) \cdot \epsilon + ((\epsilon + r_2) + r_3) \cdot (r_4 \cdot \varnothing)
 \label{big}
 \end{equation}
 \noindent If we can find an equivalent regular expression that
 is simpler (smaller for example), then this might potentially
 make our matching algorithm run faster. The reason is that
-whether a string $s$ is in $L(r)$ or in $L(r')$ with $r\equiv r'$
+whether a string $s$ is in $L(r)$ or in $L(r')$ with $r\equiv
-will always give the same answer. In the example above you
+r'$ will always give the same answer. In the example above you
-will see that the regular expression is equivalent to $r_1$
+will see that the regular expression is equivalent to $r_1$.
-if you iteratively apply the simplification rules from above:
+You can verify this by iteratively applying the simplification
+rules from above:
 \begin{center}
 \begin{tabular}{ll}
 & $(r_1 + \varnothing) \cdot \epsilon + ((\epsilon + r_2) + r_3) \cdot
 (\underline{r_4 \cdot \varnothing})$\smallskip\\
 \[
 nullable(r) \;\;\text{if and only if}\;\; []\in L(r)
 \]
-\noindent Note on the left-hand side we have a function we can
+\noindent Note on the left-hand side of the if-and-only-if we
-implement; on the right we have its specification (which we
+have a function we can implement; on the right we have its
-cannot implement in a programming language).
+specification (which we cannot implement in a programming
+language).
 The other function of our matching algorithm calculates a
 \emph{derivative} of a regular expression. This is a function
 which will take a regular expression, say $r$, and a
 character, say $c$, as argument and return a new regular
 expression. Be careful that the intuition behind this function
 is not so easy to grasp on first reading. Essentially this
 function solves the following problem: if $r$ can match a
 string of the form $c\!::\!s$, what does the regular
-expression look like that can match just $s$. The definition
+expression look like that can match just $s$? The definition
 of this function is as follows:
 \begin{center}
 \begin{tabular}{l@ {\hspace{2mm}}c@ {\hspace{2mm}}l}
 $der\, c\, (\varnothing)$      & $\dn$ & $\varnothing$\\
 \end{tabular}
 \end{center}
 \noindent The first two clauses can be rationalised as
 follows: recall that $der$ should calculate a regular
-expression, if the ``input'' regular expression can match a
+expression so that given the ``input'' regular expression can
-string of the form $c\!::\!s$. Since neither $\varnothing$ nor
+match a string of the form $c\!::\!s$, we want a regular
-$\epsilon$ can match such a string we return $\varnothing$. In
+expression for $s$. Since neither $\varnothing$ nor $\epsilon$
-the third case we have to make a case-distinction: In case the
+can match a string of the form $c\!::\!s$, we return
-regular expression is $c$, then clearly it can recognise a
+$\varnothing$. In the third case we have to make a
-string of the form $c\!::\!s$, just that $s$ is the empty
+case-distinction: In case the regular expression is $c$, then
-string. Therefore we return the $\epsilon$-regular expression.
+clearly it can recognise a string of the form $c\!::\!s$, just
-In the other case we again return $\varnothing$ since no
+that $s$ is the empty string. Therefore we return the
-string of the $c\!::\!s$ can be matched. Next come the
+$\epsilon$-regular expression. In the other case we again
-recursive cases. Fortunately, the $+$-case is still relatively
+return $\varnothing$ since no string of the $c\!::\!s$ can be
+matched. Next come the recursive cases, which are a bit more
+involved. Fortunately, the $+$-case is still relatively
 straightforward: all strings of the form $c\!::\!s$ are either
 matched by the regular expression $r_1$ or $r_2$. So we just
 have to recursively call $der$ with these two regular
 expressions and compose the results again with $+$. Yes, makes
 sense? The $\cdot$-case is more complicated: if $r_1\cdot r_2$
 $r_1$ and ``appending'' $r_2$. There is however one exception
 to this simple rule: if $r_1$ can match the empty string, then
 all of $c\!::\!s$ is matched by $r_2$. So in case $r_1$ is
 nullable (that is can match the empty string) we have to allow
 the choice $der\,c\,r_2$ for calculating the regular
-expression that can match $s$. Therefore we have to
+expression that can match $s$. Therefore we have to add the
-add the regular expression $der\,c\,r_2$.
+regular expression $der\,c\,r_2$ in the result. The $*$-case
-The $*$-case is again simple:
+is again simple: if $r^*$ matches a string of the form
-if $r^*$ matches a string of the form $c\!::\!s$, then the
+$c\!::\!s$, then the first part must be ``matched'' by a
-first part must be ``matched'' by a single copy of $r$.
+single copy of $r$. Therefore we call recursively $der\,c\,r$
-Therefore we call recursively $der\,c\,r$ and ``append'' $r^*$
+and ``append'' $r^*$ in order to match the rest of $s$.
-in order to match the rest of $s$.
 If this did not make sense, here is another way to rationalise
 the definition of $der$ by considering the following operation
 on sets:
 \end{tabular}
 \end{center}
 \noindent Again the operation $Der$ might help to rationalise
 this algorithm. We want to know whether $abc \in L(r_1)$. We
-do not know yet. But let us assume it is. Then $Der\,a\,L(r_1)$
+do not know yet---but let us assume it is. Then $Der\,a\,L(r_1)$
 builds the set where all the strings not starting with $a$ are
 filtered out. Of the remaining strings, the $a$ is stripped
 off. Then we continue with filtering out all strings not
 starting with $b$ and stripping off the $b$ from the remaining
 strings, that means we build $Der\,b\,(Der\,a\,(L(r_1)))$.
 Finally we filter out all strings not starting with $c$ and
 strip off $c$ from the remaining string. This is
 $Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$. Now if $abc$ was in the
 original set ($L(r_1)$), then in $Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$
-must be the empty string. If not then $abc$ was not in the
+must be the empty string. If not, then $abc$ was not in the
 language we started with.
 Our matching algorithm using $der$ and $nullable$ works
 similarly, just using regular expression instead of sets. For
-this we need to extend the notion of derivatives from
+this we need to extend the notion of derivatives from single
 characters to strings. This can be done using the following
 function, taking a string and regular expression as input and
 a regular expression as output.
 \begin{center}
 $\textit{ders}\, []\, r$     & $\dn$ & $r$ & \\
 $\textit{ders}\, (c\!::\!s)\, r$ & $\dn$ & $\textit{ders}\,s\,(der\,c\,r)$ & \\
 \end{tabular}
 \end{center}
-\noindent This function essentially iterates $der$ taking one
+\noindent This function iterates $der$ taking one character at
-character at the time from the original string until it is
+the time from the original string until it is exhausted.
-exhausted. Having $ders$ in place, we can finally define our
+Having $ders$ in place, we can finally define our matching
-matching algorithm:
+algorithm:
 \[
-matches\,s\,r = nullable(ders\,s\,r)
+matches\,s\,r \dn nullable(ders\,s\,r)
 \]
 \noindent
-We can claim that
+and we can claim that
 \[
 matches\,s\,r\quad\text{if and only if}\quad s\in L(r)
 \]
 for \pcode{matches} are shown in Figure~\ref{scala1}.
 \begin{figure}[p]
 \lstinputlisting{../progs/app5.scala}
 \caption{Scala implementation of the nullable and
-derivatives functions.\label{scala1}}
+derivatives functions. These functions are easy to
+implement in functional languages, because pattern
+matching and recursion allow us to mimic the mathematical
+definitions very closely.\label{scala1}}
 \end{figure}
 For running the algorithm with our favourite example, the evil
 regular expression $a?^{\{n\}}a^{\{n\}}$, we need to implement
 the optional regular expression and the exactly $n$-times
 table {re1.data};
 \end{axis}
 \end{tikzpicture}
 \end{center}
-\noindent Analysing this failure a bit we notice that
+\noindent Analysing this failure we notice that for
-for $a^{\{n\}}$ we generate quite big regular expressions:
+$a^{\{n\}}$ we generate quite big regular expressions:
 \begin{center}
 \begin{tabular}{rl}
 1: & $a$\\
 2: & $a\cdot a$\\
 \noindent Our algorithm traverses such regular expressions at
 least once every time a derivative is calculated. So having
 large regular expressions will cause problems. This problem
 is aggravated by $a?$ being represented as $a + \epsilon$.
-We can fix this by having an explicit constructor for
+We can however fix this by having an explicit constructor for
 $r^{\{n\}}$. In Scala we would introduce a constructor like
 \begin{center}
 \code{case class NTIMES(r: Rexp, n: Int) extends Rexp}
 \end{center}
-\noindent With this we have a constant ``size'' regular
+\noindent With this fix we have a constant ``size'' regular
-expression for our running example no matter how large $n$
+expression for our running example no matter how large $n$ is.
-is. This means we have to also add cases for $nullable$ and
+This means we have to also add cases for \pcode{NTIMES} in the
-$der$. Does the change have any effect?
+functions $nullable$ and $der$. Does the change have any
+effect?
 \begin{center}
 \begin{tikzpicture}
 \begin{axis}[
 xlabel={\pcode{a}s},
 can within 30 seconds handle regular expressions up to
 $n = 950$ before a StackOverflow is raised.
 The moral is that our algorithm is rather sensitive to the
 size of regular expressions it needs to handle. This is of
-course obvious because both $nullable$ and $der$ need to
+course obvious because both $nullable$ and $der$ frequently
-traverse the whole regular expression. There seems to be one
+need to traverse the whole regular expression. There seems,
-more issue for making the algorithm run faster. The derivative
+however, one more issue for making the algorithm run faster.
-function often produces ``useless'' $\varnothing$s and
+The derivative function often produces ``useless''
-$\epsilon$s. To see this, consider $r = ((a \cdot b) + b)^*$
+$\varnothing$s and $\epsilon$s. To see this, consider $r = ((a
-and the following two derivatives
+\cdot b) + b)^*$ and the following two derivatives
 \begin{center}
 \begin{tabular}{l}
 $der\,a\,r = ((\epsilon \cdot b) + \varnothing) \cdot r$\\
 $der\,b\,r = ((\varnothing \cdot b) + \epsilon)\cdot r$\\
 regular expression and simplifies it according to the rules
 given at the beginning. There are only rules for $+$, $\cdot$
 and $n$-times (the latter because we added it in the second
 version of our matcher). There is no rule for a star, because
 empirical data and also a little thought showed that
-simplifying under a star is waste of computation time. The
+simplifying under a star is a waste of computation time. The
 simplification function will be called after every derivation.
 This additional step removes all the ``junk'' the derivative
 function introduced. Does this improve the speed? You bet!!
 \begin{figure}[p]
 \lstinputlisting{../progs/app6.scala}
 \caption{The simplification function and modified
-\texttt{ders}-function.\label{scala2}}
+\texttt{ders}-function; this function now
+calls \texttt{der} first, but then tries to simplify
+the resulting derivative regular expressions.\label{scala2}}
 \end{figure}
 \begin{center}
 \begin{tikzpicture}
 \begin{axis}[xlabel={\pcode{a}s},ylabel={time in secs},

changeset 325	794c599cee53
parent 318	7975e4f0d4de
child 332	4755ad4b457b