--- a/handouts/ho02.tex Fri Apr 17 04:56:30 2015 +0100
+++ b/handouts/ho02.tex Fri May 01 19:19:31 2015 +0100
@@ -11,10 +11,13 @@
This lecture is about implementing a more efficient regular
expression matcher (the plots on the right)---more efficient
-than the matchers from regular expression libraries in Ruby and
-Python (the plots on the left). These plots show the running
-time for the evil regular expression $a?^{\{n\}}a^{\{n\}}$.
-Note the different scales of the $x$-axes.
+than the matchers from regular expression libraries in Ruby
+and Python (the plots on the left). These plots show the
+running time for the evil regular expression
+$a?^{\{n\}}a^{\{n\}}$ and string composed of $n$ \pcode{a}s.
+We will use this regular expression and strings as running
+example. To see the substantial differences in the two plots
+below, note the different scales of the $x$-axes.
\begin{center}
@@ -60,10 +63,9 @@
\noindent Having specified in the previous lecture what
-problem our regular expression matcher, which we will call
-\pcode{matches}, is supposed to solve, namely for any given
-regular expression $r$ and string $s$ answer \textit{true} if
-and only if
+problem our regular expression matcher is supposed to solve,
+namely for any given regular expression $r$ and string $s$
+answer \textit{true} if and only if
\[
s \in L(r)
@@ -75,8 +77,8 @@
(recall what $L(a^*)$ is). In such cases there is no way we
can implement an exhaustive test for whether a string is
member of this set or not. In contrast our matching algorithm
-will mainly operate on the regular expression $r$ and string
-$s$, which are both finite. Before we come to the matching
+will operate on the regular expression $r$ and string $s$,
+only, which are both finite. Before we come to the matching
algorithm, however, let us have a closer look at what it means
when two regular expressions are equivalent.
@@ -149,19 +151,19 @@
\end{center}
\noindent which always hold no matter what the regular
-expression $r$ looks like. The first two are easy to verify since
-$L(\varnothing)$ is the empty set. The next two are also easy
-to verify since $L(\epsilon) = \{[]\}$ and appending the empty
-string to every string of another set, leaves the set
-unchanged. Be careful to fully comprehend the fifth and
-sixth equivalence: if you concatenate two sets of strings
-and one is the empty set, then the concatenation will also be
-the empty set. Check the definition of \pcode{_ @ _}.
-The last equivalence is again trivial.
+expression $r$ looks like. The first two are easy to verify
+since $L(\varnothing)$ is the empty set. The next two are also
+easy to verify since $L(\epsilon) = \{[]\}$ and appending the
+empty string to every string of another set, leaves the set
+unchanged. Be careful to fully comprehend the fifth and sixth
+equivalence: if you concatenate two sets of strings and one is
+the empty set, then the concatenation will also be the empty
+set. To see this, check the definition of \pcode{_ @ _}. The
+last equivalence is again trivial.
What will be important later on is that we can orient these
equivalences and read them from left to right. In this way we
-can view them as \emph{simplification rules}. Suppose for
+can view them as \emph{simplification rules}. Consider for
example the regular expression
\begin{equation}
@@ -170,12 +172,13 @@
\end{equation}
\noindent If we can find an equivalent regular expression that
-is simpler (smaller for example), then this might potentially
-make our matching algorithm run faster. The reason is that
-whether a string $s$ is in $L(r)$ or in $L(r')$ with $r\equiv r'$
-will always give the same answer. In the example above you
-will see that the regular expression is equivalent to $r_1$
-if you iteratively apply the simplification rules from above:
+is simpler (smaller for example), then this might potentially
+make our matching algorithm run faster. The reason is that
+whether a string $s$ is in $L(r)$ or in $L(r')$ with $r\equiv
+r'$ will always give the same answer. In the example above you
+will see that the regular expression is equivalent to $r_1$.
+You can verify this by iteratively applying the simplification
+rules from above:
\begin{center}
\begin{tabular}{ll}
@@ -222,9 +225,10 @@
nullable(r) \;\;\text{if and only if}\;\; []\in L(r)
\]
-\noindent Note on the left-hand side we have a function we can
-implement; on the right we have its specification (which we
-cannot implement in a programming language).
+\noindent Note on the left-hand side of the if-and-only-if we
+have a function we can implement; on the right we have its
+specification (which we cannot implement in a programming
+language).
The other function of our matching algorithm calculates a
\emph{derivative} of a regular expression. This is a function
@@ -234,7 +238,7 @@
is not so easy to grasp on first reading. Essentially this
function solves the following problem: if $r$ can match a
string of the form $c\!::\!s$, what does the regular
-expression look like that can match just $s$. The definition
+expression look like that can match just $s$? The definition
of this function is as follows:
\begin{center}
@@ -252,16 +256,18 @@
\noindent The first two clauses can be rationalised as
follows: recall that $der$ should calculate a regular
-expression, if the ``input'' regular expression can match a
-string of the form $c\!::\!s$. Since neither $\varnothing$ nor
-$\epsilon$ can match such a string we return $\varnothing$. In
-the third case we have to make a case-distinction: In case the
-regular expression is $c$, then clearly it can recognise a
-string of the form $c\!::\!s$, just that $s$ is the empty
-string. Therefore we return the $\epsilon$-regular expression.
-In the other case we again return $\varnothing$ since no
-string of the $c\!::\!s$ can be matched. Next come the
-recursive cases. Fortunately, the $+$-case is still relatively
+expression so that given the ``input'' regular expression can
+match a string of the form $c\!::\!s$, we want a regular
+expression for $s$. Since neither $\varnothing$ nor $\epsilon$
+can match a string of the form $c\!::\!s$, we return
+$\varnothing$. In the third case we have to make a
+case-distinction: In case the regular expression is $c$, then
+clearly it can recognise a string of the form $c\!::\!s$, just
+that $s$ is the empty string. Therefore we return the
+$\epsilon$-regular expression. In the other case we again
+return $\varnothing$ since no string of the $c\!::\!s$ can be
+matched. Next come the recursive cases, which are a bit more
+involved. Fortunately, the $+$-case is still relatively
straightforward: all strings of the form $c\!::\!s$ are either
matched by the regular expression $r_1$ or $r_2$. So we just
have to recursively call $der$ with these two regular
@@ -275,13 +281,12 @@
all of $c\!::\!s$ is matched by $r_2$. So in case $r_1$ is
nullable (that is can match the empty string) we have to allow
the choice $der\,c\,r_2$ for calculating the regular
-expression that can match $s$. Therefore we have to
-add the regular expression $der\,c\,r_2$.
-The $*$-case is again simple:
-if $r^*$ matches a string of the form $c\!::\!s$, then the
-first part must be ``matched'' by a single copy of $r$.
-Therefore we call recursively $der\,c\,r$ and ``append'' $r^*$
-in order to match the rest of $s$.
+expression that can match $s$. Therefore we have to add the
+regular expression $der\,c\,r_2$ in the result. The $*$-case
+is again simple: if $r^*$ matches a string of the form
+$c\!::\!s$, then the first part must be ``matched'' by a
+single copy of $r$. Therefore we call recursively $der\,c\,r$
+and ``append'' $r^*$ in order to match the rest of $s$.
If this did not make sense, here is another way to rationalise
the definition of $der$ by considering the following operation
@@ -333,7 +338,7 @@
\noindent Again the operation $Der$ might help to rationalise
this algorithm. We want to know whether $abc \in L(r_1)$. We
-do not know yet. But let us assume it is. Then $Der\,a\,L(r_1)$
+do not know yet---but let us assume it is. Then $Der\,a\,L(r_1)$
builds the set where all the strings not starting with $a$ are
filtered out. Of the remaining strings, the $a$ is stripped
off. Then we continue with filtering out all strings not
@@ -343,12 +348,12 @@
strip off $c$ from the remaining string. This is
$Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$. Now if $abc$ was in the
original set ($L(r_1)$), then in $Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$
-must be the empty string. If not then $abc$ was not in the
+must be the empty string. If not, then $abc$ was not in the
language we started with.
Our matching algorithm using $der$ and $nullable$ works
similarly, just using regular expression instead of sets. For
-this we need to extend the notion of derivatives from
+this we need to extend the notion of derivatives from single
characters to strings. This can be done using the following
function, taking a string and regular expression as input and
a regular expression as output.
@@ -360,17 +365,17 @@
\end{tabular}
\end{center}
-\noindent This function essentially iterates $der$ taking one
-character at the time from the original string until it is
-exhausted. Having $ders$ in place, we can finally define our
-matching algorithm:
+\noindent This function iterates $der$ taking one character at
+the time from the original string until it is exhausted.
+Having $ders$ in place, we can finally define our matching
+algorithm:
\[
-matches\,s\,r = nullable(ders\,s\,r)
+matches\,s\,r \dn nullable(ders\,s\,r)
\]
\noindent
-We can claim that
+and we can claim that
\[
matches\,s\,r\quad\text{if and only if}\quad s\in L(r)
@@ -398,7 +403,10 @@
\begin{figure}[p]
\lstinputlisting{../progs/app5.scala}
\caption{Scala implementation of the nullable and
- derivatives functions.\label{scala1}}
+ derivatives functions. These functions are easy to
+ implement in functional languages, because pattern
+ matching and recursion allow us to mimic the mathematical
+ definitions very closely.\label{scala1}}
\end{figure}
For running the algorithm with our favourite example, the evil
@@ -432,8 +440,8 @@
table {re1.data};
\end{axis}
\end{tikzpicture}
\end{center}
-\noindent Analysing this failure a bit we notice that
-for $a^{\{n\}}$ we generate quite big regular expressions:
+\noindent Analysing this failure we notice that for
+$a^{\{n\}}$ we generate quite big regular expressions:
\begin{center}
\begin{tabular}{rl}
@@ -451,17 +459,18 @@
large regular expressions will cause problems. This problem
is aggravated by $a?$ being represented as $a + \epsilon$.
-We can fix this by having an explicit constructor for
+We can however fix this by having an explicit constructor for
$r^{\{n\}}$. In Scala we would introduce a constructor like
\begin{center}
\code{case class NTIMES(r: Rexp, n: Int) extends Rexp}
\end{center}
-\noindent With this we have a constant ``size'' regular
-expression for our running example no matter how large $n$
-is. This means we have to also add cases for $nullable$ and
-$der$. Does the change have any effect?
+\noindent With this fix we have a constant ``size'' regular
+expression for our running example no matter how large $n$ is.
+This means we have to also add cases for \pcode{NTIMES} in the
+functions $nullable$ and $der$. Does the change have any
+effect?
\begin{center}
\begin{tikzpicture}
\begin{axis}[
xlabel={\pcode{a}s},
ylabel={time in secs},
@@ -490,12 +499,12 @@
The moral is that our algorithm is rather sensitive to the
size of regular expressions it needs to handle. This is of
-course obvious because both $nullable$ and $der$ need to
-traverse the whole regular expression. There seems to be one
-more issue for making the algorithm run faster. The derivative
-function often produces ``useless'' $\varnothing$s and
-$\epsilon$s. To see this, consider $r = ((a \cdot b) + b)^*$
-and the following two derivatives
+course obvious because both $nullable$ and $der$ frequently
+need to traverse the whole regular expression. There seems,
+however, one more issue for making the algorithm run faster.
+The derivative function often produces ``useless''
+$\varnothing$s and $\epsilon$s. To see this, consider $r = ((a
+\cdot b) + b)^*$ and the following two derivatives
\begin{center}
\begin{tabular}{l}
@@ -527,7 +536,7 @@
and $n$-times (the latter because we added it in the second
version of our matcher). There is no rule for a star, because
empirical data and also a little thought showed that
-simplifying under a star is waste of computation time. The
+simplifying under a star is a waste of computation time. The
simplification function will be called after every derivation.
This additional step removes all the ``junk'' the derivative
function introduced. Does this improve the speed? You bet!!
@@ -535,7 +544,9 @@
\begin{figure}[p]
\lstinputlisting{../progs/app6.scala}
\caption{The simplification function and modified
-\texttt{ders}-function.\label{scala2}}
+\texttt{ders}-function; this function now
+calls \texttt{der} first, but then tries to simplify
+the resulting derivative regular expressions.\label{scala2}}
\end{figure}
\begin{center}