diff -r 6cb517754d8a -r 794c599cee53 handouts/ho02.tex --- a/handouts/ho02.tex Fri Apr 17 04:56:30 2015 +0100 +++ b/handouts/ho02.tex Fri May 01 19:19:31 2015 +0100 @@ -11,10 +11,13 @@ This lecture is about implementing a more efficient regular expression matcher (the plots on the right)---more efficient -than the matchers from regular expression libraries in Ruby and -Python (the plots on the left). These plots show the running -time for the evil regular expression $a?^{\{n\}}a^{\{n\}}$. -Note the different scales of the $x$-axes. +than the matchers from regular expression libraries in Ruby +and Python (the plots on the left). These plots show the +running time for the evil regular expression +$a?^{\{n\}}a^{\{n\}}$ and string composed of $n$ \pcode{a}s. +We will use this regular expression and strings as running +example. To see the substantial differences in the two plots +below, note the different scales of the $x$-axes. \begin{center} @@ -60,10 +63,9 @@ \noindent Having specified in the previous lecture what -problem our regular expression matcher, which we will call -\pcode{matches}, is supposed to solve, namely for any given -regular expression $r$ and string $s$ answer \textit{true} if -and only if +problem our regular expression matcher is supposed to solve, +namely for any given regular expression $r$ and string $s$ +answer \textit{true} if and only if \[ s \in L(r) @@ -75,8 +77,8 @@ (recall what $L(a^*)$ is). In such cases there is no way we can implement an exhaustive test for whether a string is member of this set or not. In contrast our matching algorithm -will mainly operate on the regular expression $r$ and string -$s$, which are both finite. Before we come to the matching +will operate on the regular expression $r$ and string $s$, +only, which are both finite. Before we come to the matching algorithm, however, let us have a closer look at what it means when two regular expressions are equivalent. @@ -149,19 +151,19 @@ \end{center} \noindent which always hold no matter what the regular -expression $r$ looks like. The first two are easy to verify since -$L(\varnothing)$ is the empty set. The next two are also easy -to verify since $L(\epsilon) = \{[]\}$ and appending the empty -string to every string of another set, leaves the set -unchanged. Be careful to fully comprehend the fifth and -sixth equivalence: if you concatenate two sets of strings -and one is the empty set, then the concatenation will also be -the empty set. Check the definition of \pcode{_ @ _}. -The last equivalence is again trivial. +expression $r$ looks like. The first two are easy to verify +since $L(\varnothing)$ is the empty set. The next two are also +easy to verify since $L(\epsilon) = \{[]\}$ and appending the +empty string to every string of another set, leaves the set +unchanged. Be careful to fully comprehend the fifth and sixth +equivalence: if you concatenate two sets of strings and one is +the empty set, then the concatenation will also be the empty +set. To see this, check the definition of \pcode{_ @ _}. The +last equivalence is again trivial. What will be important later on is that we can orient these equivalences and read them from left to right. In this way we -can view them as \emph{simplification rules}. Suppose for +can view them as \emph{simplification rules}. Consider for example the regular expression \begin{equation} @@ -170,12 +172,13 @@ \end{equation} \noindent If we can find an equivalent regular expression that -is simpler (smaller for example), then this might potentially -make our matching algorithm run faster. The reason is that -whether a string $s$ is in $L(r)$ or in $L(r')$ with $r\equiv r'$ -will always give the same answer. In the example above you -will see that the regular expression is equivalent to $r_1$ -if you iteratively apply the simplification rules from above: +is simpler (smaller for example), then this might potentially +make our matching algorithm run faster. The reason is that +whether a string $s$ is in $L(r)$ or in $L(r')$ with $r\equiv +r'$ will always give the same answer. In the example above you +will see that the regular expression is equivalent to $r_1$. +You can verify this by iteratively applying the simplification +rules from above: \begin{center} \begin{tabular}{ll} @@ -222,9 +225,10 @@ nullable(r) \;\;\text{if and only if}\;\; []\in L(r) \] -\noindent Note on the left-hand side we have a function we can -implement; on the right we have its specification (which we -cannot implement in a programming language). +\noindent Note on the left-hand side of the if-and-only-if we +have a function we can implement; on the right we have its +specification (which we cannot implement in a programming +language). The other function of our matching algorithm calculates a \emph{derivative} of a regular expression. This is a function @@ -234,7 +238,7 @@ is not so easy to grasp on first reading. Essentially this function solves the following problem: if $r$ can match a string of the form $c\!::\!s$, what does the regular -expression look like that can match just $s$. The definition +expression look like that can match just $s$? The definition of this function is as follows: \begin{center} @@ -252,16 +256,18 @@ \noindent The first two clauses can be rationalised as follows: recall that $der$ should calculate a regular -expression, if the ``input'' regular expression can match a -string of the form $c\!::\!s$. Since neither $\varnothing$ nor -$\epsilon$ can match such a string we return $\varnothing$. In -the third case we have to make a case-distinction: In case the -regular expression is $c$, then clearly it can recognise a -string of the form $c\!::\!s$, just that $s$ is the empty -string. Therefore we return the $\epsilon$-regular expression. -In the other case we again return $\varnothing$ since no -string of the $c\!::\!s$ can be matched. Next come the -recursive cases. Fortunately, the $+$-case is still relatively +expression so that given the ``input'' regular expression can +match a string of the form $c\!::\!s$, we want a regular +expression for $s$. Since neither $\varnothing$ nor $\epsilon$ +can match a string of the form $c\!::\!s$, we return +$\varnothing$. In the third case we have to make a +case-distinction: In case the regular expression is $c$, then +clearly it can recognise a string of the form $c\!::\!s$, just +that $s$ is the empty string. Therefore we return the +$\epsilon$-regular expression. In the other case we again +return $\varnothing$ since no string of the $c\!::\!s$ can be +matched. Next come the recursive cases, which are a bit more +involved. Fortunately, the $+$-case is still relatively straightforward: all strings of the form $c\!::\!s$ are either matched by the regular expression $r_1$ or $r_2$. So we just have to recursively call $der$ with these two regular @@ -275,13 +281,12 @@ all of $c\!::\!s$ is matched by $r_2$. So in case $r_1$ is nullable (that is can match the empty string) we have to allow the choice $der\,c\,r_2$ for calculating the regular -expression that can match $s$. Therefore we have to -add the regular expression $der\,c\,r_2$. -The $*$-case is again simple: -if $r^*$ matches a string of the form $c\!::\!s$, then the -first part must be ``matched'' by a single copy of $r$. -Therefore we call recursively $der\,c\,r$ and ``append'' $r^*$ -in order to match the rest of $s$. +expression that can match $s$. Therefore we have to add the +regular expression $der\,c\,r_2$ in the result. The $*$-case +is again simple: if $r^*$ matches a string of the form +$c\!::\!s$, then the first part must be ``matched'' by a +single copy of $r$. Therefore we call recursively $der\,c\,r$ +and ``append'' $r^*$ in order to match the rest of $s$. If this did not make sense, here is another way to rationalise the definition of $der$ by considering the following operation @@ -333,7 +338,7 @@ \noindent Again the operation $Der$ might help to rationalise this algorithm. We want to know whether $abc \in L(r_1)$. We -do not know yet. But let us assume it is. Then $Der\,a\,L(r_1)$ +do not know yet---but let us assume it is. Then $Der\,a\,L(r_1)$ builds the set where all the strings not starting with $a$ are filtered out. Of the remaining strings, the $a$ is stripped off. Then we continue with filtering out all strings not @@ -343,12 +348,12 @@ strip off $c$ from the remaining string. This is $Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$. Now if $abc$ was in the original set ($L(r_1)$), then in $Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$ -must be the empty string. If not then $abc$ was not in the +must be the empty string. If not, then $abc$ was not in the language we started with. Our matching algorithm using $der$ and $nullable$ works similarly, just using regular expression instead of sets. For -this we need to extend the notion of derivatives from +this we need to extend the notion of derivatives from single characters to strings. This can be done using the following function, taking a string and regular expression as input and a regular expression as output. @@ -360,17 +365,17 @@ \end{tabular} \end{center} -\noindent This function essentially iterates $der$ taking one -character at the time from the original string until it is -exhausted. Having $ders$ in place, we can finally define our -matching algorithm: +\noindent This function iterates $der$ taking one character at +the time from the original string until it is exhausted. +Having $ders$ in place, we can finally define our matching +algorithm: \[ -matches\,s\,r = nullable(ders\,s\,r) +matches\,s\,r \dn nullable(ders\,s\,r) \] \noindent -We can claim that +and we can claim that \[ matches\,s\,r\quad\text{if and only if}\quad s\in L(r) @@ -398,7 +403,10 @@ \begin{figure}[p] \lstinputlisting{../progs/app5.scala} \caption{Scala implementation of the nullable and - derivatives functions.\label{scala1}} + derivatives functions. These functions are easy to + implement in functional languages, because pattern + matching and recursion allow us to mimic the mathematical + definitions very closely.\label{scala1}} \end{figure} For running the algorithm with our favourite example, the evil @@ -432,8 +440,8 @@ table {re1.data}; \end{axis} \end{tikzpicture} \end{center} -\noindent Analysing this failure a bit we notice that -for $a^{\{n\}}$ we generate quite big regular expressions: +\noindent Analysing this failure we notice that for +$a^{\{n\}}$ we generate quite big regular expressions: \begin{center} \begin{tabular}{rl} @@ -451,17 +459,18 @@ large regular expressions will cause problems. This problem is aggravated by $a?$ being represented as $a + \epsilon$. -We can fix this by having an explicit constructor for +We can however fix this by having an explicit constructor for $r^{\{n\}}$. In Scala we would introduce a constructor like \begin{center} \code{case class NTIMES(r: Rexp, n: Int) extends Rexp} \end{center} -\noindent With this we have a constant ``size'' regular -expression for our running example no matter how large $n$ -is. This means we have to also add cases for $nullable$ and -$der$. Does the change have any effect? +\noindent With this fix we have a constant ``size'' regular +expression for our running example no matter how large $n$ is. +This means we have to also add cases for \pcode{NTIMES} in the +functions $nullable$ and $der$. Does the change have any +effect? \begin{center} \begin{tikzpicture} \begin{axis}[ xlabel={\pcode{a}s}, ylabel={time in secs}, @@ -490,12 +499,12 @@ The moral is that our algorithm is rather sensitive to the size of regular expressions it needs to handle. This is of -course obvious because both $nullable$ and $der$ need to -traverse the whole regular expression. There seems to be one -more issue for making the algorithm run faster. The derivative -function often produces ``useless'' $\varnothing$s and -$\epsilon$s. To see this, consider $r = ((a \cdot b) + b)^*$ -and the following two derivatives +course obvious because both $nullable$ and $der$ frequently +need to traverse the whole regular expression. There seems, +however, one more issue for making the algorithm run faster. +The derivative function often produces ``useless'' +$\varnothing$s and $\epsilon$s. To see this, consider $r = ((a +\cdot b) + b)^*$ and the following two derivatives \begin{center} \begin{tabular}{l} @@ -527,7 +536,7 @@ and $n$-times (the latter because we added it in the second version of our matcher). There is no rule for a star, because empirical data and also a little thought showed that -simplifying under a star is waste of computation time. The +simplifying under a star is a waste of computation time. The simplification function will be called after every derivation. This additional step removes all the ``junk'' the derivative function introduced. Does this improve the speed? You bet!! @@ -535,7 +544,9 @@ \begin{figure}[p] \lstinputlisting{../progs/app6.scala} \caption{The simplification function and modified -\texttt{ders}-function.\label{scala2}} +\texttt{ders}-function; this function now +calls \texttt{der} first, but then tries to simplify +the resulting derivative regular expressions.\label{scala2}} \end{figure} \begin{center}