handouts/ho02.tex
changeset 325 794c599cee53
parent 318 7975e4f0d4de
child 332 4755ad4b457b
--- a/handouts/ho02.tex	Fri Apr 17 04:56:30 2015 +0100
+++ b/handouts/ho02.tex	Fri May 01 19:19:31 2015 +0100
@@ -11,10 +11,13 @@
 
 This lecture is about implementing a more efficient regular
 expression matcher (the plots on the right)---more efficient
-than the matchers from regular expression libraries in Ruby and
-Python (the plots on the left). These plots show the running 
-time for the evil regular expression $a?^{\{n\}}a^{\{n\}}$.
-Note the different scales of the $x$-axes. 
+than the matchers from regular expression libraries in Ruby
+and Python (the plots on the left). These plots show the
+running time for the evil regular expression
+$a?^{\{n\}}a^{\{n\}}$ and string composed of $n$ \pcode{a}s.
+We will use this regular expression and strings as running
+example. To see the substantial differences in the two plots
+below, note the different scales of the $x$-axes. 
 
 
 \begin{center}
@@ -60,10 +63,9 @@
 
 
 \noindent Having specified in the previous lecture what
-problem our regular expression matcher, which we will call
-\pcode{matches}, is supposed to solve, namely for any given
-regular expression $r$ and string $s$ answer \textit{true} if
-and only if
+problem our regular expression matcher is supposed to solve,
+namely for any given regular expression $r$ and string $s$
+answer \textit{true} if and only if
 
 \[
 s \in L(r)
@@ -75,8 +77,8 @@
 (recall what $L(a^*)$ is). In such cases there is no way we
 can implement an exhaustive test for whether a string is
 member of this set or not. In contrast our matching algorithm
-will mainly operate on the regular expression $r$ and string
-$s$, which are both finite. Before we come to the matching
+will operate on the regular expression $r$ and string $s$,
+only, which are both finite. Before we come to the matching
 algorithm, however, let us have a closer look at what it means
 when two regular expressions are equivalent.
 
@@ -149,19 +151,19 @@
 \end{center}
 
 \noindent which always hold no matter what the regular
-expression $r$ looks like. The first two are easy to verify since
-$L(\varnothing)$ is the empty set. The next two are also easy 
-to verify since $L(\epsilon) = \{[]\}$ and appending the empty 
-string to every string of another set, leaves the set 
-unchanged. Be careful to fully comprehend the fifth and 
-sixth equivalence: if you concatenate two sets of strings
-and one is the empty set, then the concatenation will also be
-the empty set. Check the definition of \pcode{_ @ _}.
-The last equivalence is again trivial.
+expression $r$ looks like. The first two are easy to verify
+since $L(\varnothing)$ is the empty set. The next two are also
+easy to verify since $L(\epsilon) = \{[]\}$ and appending the
+empty string to every string of another set, leaves the set
+unchanged. Be careful to fully comprehend the fifth and sixth
+equivalence: if you concatenate two sets of strings and one is
+the empty set, then the concatenation will also be the empty
+set. To see this, check the definition of \pcode{_ @ _}. The
+last equivalence is again trivial.
 
 What will be important later on is that we can orient these
 equivalences and read them from left to right. In this way we
-can view them as \emph{simplification rules}. Suppose for 
+can view them as \emph{simplification rules}. Consider for 
 example the regular expression 
 
 \begin{equation}
@@ -170,12 +172,13 @@
 \end{equation}
 
 \noindent If we can find an equivalent regular expression that
-is simpler (smaller for example), then this might potentially 
-make our matching algorithm run faster. The reason is that 
-whether a string $s$ is in $L(r)$ or in $L(r')$ with $r\equiv r'$
-will always give the same answer. In the example above you 
-will see that the regular expression is equivalent to $r_1$
-if you iteratively apply the simplification rules from above:
+is simpler (smaller for example), then this might potentially
+make our matching algorithm run faster. The reason is that
+whether a string $s$ is in $L(r)$ or in $L(r')$ with $r\equiv
+r'$ will always give the same answer. In the example above you
+will see that the regular expression is equivalent to $r_1$.
+You can verify this by iteratively applying the simplification
+rules from above:
 
 \begin{center}
 \begin{tabular}{ll}
@@ -222,9 +225,10 @@
 nullable(r) \;\;\text{if and only if}\;\; []\in L(r)
 \]
 
-\noindent Note on the left-hand side we have a function we can
-implement; on the right we have its specification (which we
-cannot implement in a programming language).
+\noindent Note on the left-hand side of the if-and-only-if we
+have a function we can implement; on the right we have its
+specification (which we cannot implement in a programming
+language).
 
 The other function of our matching algorithm calculates a
 \emph{derivative} of a regular expression. This is a function
@@ -234,7 +238,7 @@
 is not so easy to grasp on first reading. Essentially this
 function solves the following problem: if $r$ can match a
 string of the form $c\!::\!s$, what does the regular
-expression look like that can match just $s$. The definition
+expression look like that can match just $s$? The definition
 of this function is as follows:
 
 \begin{center}
@@ -252,16 +256,18 @@
 
 \noindent The first two clauses can be rationalised as
 follows: recall that $der$ should calculate a regular
-expression, if the ``input'' regular expression can match a
-string of the form $c\!::\!s$. Since neither $\varnothing$ nor
-$\epsilon$ can match such a string we return $\varnothing$. In
-the third case we have to make a case-distinction: In case the
-regular expression is $c$, then clearly it can recognise a
-string of the form $c\!::\!s$, just that $s$ is the empty
-string. Therefore we return the $\epsilon$-regular expression.
-In the other case we again return $\varnothing$ since no
-string of the $c\!::\!s$ can be matched. Next come the
-recursive cases. Fortunately, the $+$-case is still relatively
+expression so that given the ``input'' regular expression can
+match a string of the form $c\!::\!s$, we want a regular
+expression for $s$. Since neither $\varnothing$ nor $\epsilon$
+can match a string of the form $c\!::\!s$, we return
+$\varnothing$. In the third case we have to make a
+case-distinction: In case the regular expression is $c$, then
+clearly it can recognise a string of the form $c\!::\!s$, just
+that $s$ is the empty string. Therefore we return the
+$\epsilon$-regular expression. In the other case we again
+return $\varnothing$ since no string of the $c\!::\!s$ can be
+matched. Next come the recursive cases, which are a bit more
+involved. Fortunately, the $+$-case is still relatively
 straightforward: all strings of the form $c\!::\!s$ are either
 matched by the regular expression $r_1$ or $r_2$. So we just
 have to recursively call $der$ with these two regular
@@ -275,13 +281,12 @@
 all of $c\!::\!s$ is matched by $r_2$. So in case $r_1$ is
 nullable (that is can match the empty string) we have to allow
 the choice $der\,c\,r_2$ for calculating the regular
-expression that can match $s$. Therefore we have to 
-add the regular expression $der\,c\,r_2$.
-The $*$-case is again simple:
-if $r^*$ matches a string of the form $c\!::\!s$, then the
-first part must be ``matched'' by a single copy of $r$.
-Therefore we call recursively $der\,c\,r$ and ``append'' $r^*$
-in order to match the rest of $s$.
+expression that can match $s$. Therefore we have to add the
+regular expression $der\,c\,r_2$ in the result. The $*$-case
+is again simple: if $r^*$ matches a string of the form
+$c\!::\!s$, then the first part must be ``matched'' by a
+single copy of $r$. Therefore we call recursively $der\,c\,r$
+and ``append'' $r^*$ in order to match the rest of $s$.
 
 If this did not make sense, here is another way to rationalise
 the definition of $der$ by considering the following operation
@@ -333,7 +338,7 @@
 
 \noindent Again the operation $Der$ might help to rationalise
 this algorithm. We want to know whether $abc \in L(r_1)$. We
-do not know yet. But let us assume it is. Then $Der\,a\,L(r_1)$
+do not know yet---but let us assume it is. Then $Der\,a\,L(r_1)$
 builds the set where all the strings not starting with $a$ are
 filtered out. Of the remaining strings, the $a$ is stripped
 off. Then we continue with filtering out all strings not
@@ -343,12 +348,12 @@
 strip off $c$ from the remaining string. This is
 $Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$. Now if $abc$ was in the 
 original set ($L(r_1)$), then in $Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$ 
-must be the empty string. If not then $abc$ was not in the 
+must be the empty string. If not, then $abc$ was not in the 
 language we started with.
 
 Our matching algorithm using $der$ and $nullable$ works
 similarly, just using regular expression instead of sets. For
-this we need to extend the notion of derivatives from
+this we need to extend the notion of derivatives from single
 characters to strings. This can be done using the following
 function, taking a string and regular expression as input and
 a regular expression as output.
@@ -360,17 +365,17 @@
   \end{tabular}
 \end{center}
 
-\noindent This function essentially iterates $der$ taking one
-character at the time from the original string until it is
-exhausted. Having $ders$ in place, we can finally define our
-matching algorithm:
+\noindent This function iterates $der$ taking one character at
+the time from the original string until it is exhausted.
+Having $ders$ in place, we can finally define our matching
+algorithm:
 
 \[
-matches\,s\,r = nullable(ders\,s\,r)
+matches\,s\,r \dn nullable(ders\,s\,r)
 \]
 
 \noindent
-We can claim that
+and we can claim that
 
 \[
 matches\,s\,r\quad\text{if and only if}\quad s\in L(r)
@@ -398,7 +403,10 @@
 \begin{figure}[p]
 \lstinputlisting{../progs/app5.scala}
 \caption{Scala implementation of the nullable and 
-  derivatives functions.\label{scala1}}
+  derivatives functions. These functions are easy to
+  implement in functional languages, because pattern 
+  matching and recursion allow us to mimic the mathematical
+  definitions very closely.\label{scala1}}
 \end{figure}
 
 For running the algorithm with our favourite example, the evil
@@ -432,8 +440,8 @@
   table {re1.data};  
\end{axis}
\end{tikzpicture}
 \end{center}
 
-\noindent Analysing this failure a bit we notice that 
-for $a^{\{n\}}$ we generate quite big regular expressions:
+\noindent Analysing this failure we notice that for
+$a^{\{n\}}$ we generate quite big regular expressions:
 
 \begin{center}
 \begin{tabular}{rl}
@@ -451,17 +459,18 @@
 large regular expressions will cause problems. This problem
 is aggravated by $a?$ being represented as $a + \epsilon$.
 
-We can fix this by having an explicit constructor for 
+We can however fix this by having an explicit constructor for
 $r^{\{n\}}$. In Scala we would introduce a constructor like
 
 \begin{center}
 \code{case class NTIMES(r: Rexp, n: Int) extends Rexp}
 \end{center}
 
-\noindent With this we have a constant ``size'' regular
-expression for our running example no matter how large $n$ 
-is. This means we have to also add cases for $nullable$ and
-$der$. Does the change have any effect?
+\noindent With this fix we have a constant ``size'' regular
+expression for our running example no matter how large $n$ is.
+This means we have to also add cases for \pcode{NTIMES} in the
+functions $nullable$ and $der$. Does the change have any
+effect?
 
 \begin{center}
 \begin{tikzpicture}
\begin{axis}[
    xlabel={\pcode{a}s},
    ylabel={time in secs},
@@ -490,12 +499,12 @@
 
 The moral is that our algorithm is rather sensitive to the
 size of regular expressions it needs to handle. This is of
-course obvious because both $nullable$ and $der$ need to
-traverse the whole regular expression. There seems to be one
-more issue for making the algorithm run faster. The derivative
-function often produces ``useless'' $\varnothing$s and
-$\epsilon$s. To see this, consider $r = ((a \cdot b) + b)^*$
-and the following two derivatives
+course obvious because both $nullable$ and $der$ frequently
+need to traverse the whole regular expression. There seems,
+however, one more issue for making the algorithm run faster.
+The derivative function often produces ``useless''
+$\varnothing$s and $\epsilon$s. To see this, consider $r = ((a
+\cdot b) + b)^*$ and the following two derivatives
 
 \begin{center}
 \begin{tabular}{l}
@@ -527,7 +536,7 @@
 and $n$-times (the latter because we added it in the second
 version of our matcher). There is no rule for a star, because
 empirical data and also a little thought showed that
-simplifying under a star is waste of computation time. The
+simplifying under a star is a waste of computation time. The
 simplification function will be called after every derivation.
 This additional step removes all the ``junk'' the derivative
 function introduced. Does this improve the speed? You bet!!
@@ -535,7 +544,9 @@
 \begin{figure}[p]
 \lstinputlisting{../progs/app6.scala}
 \caption{The simplification function and modified 
-\texttt{ders}-function.\label{scala2}}
+\texttt{ders}-function; this function now
+calls \texttt{der} first, but then tries to simplify
+the resulting derivative regular expressions.\label{scala2}}
 \end{figure}
 
 \begin{center}