diff -r 1aec0e1fda86 -r 1cef3924f7a2 handouts/ho02.tex --- a/handouts/ho02.tex Sun Aug 21 18:15:53 2016 +0200 +++ b/handouts/ho02.tex Mon Aug 22 09:12:03 2016 +0200 @@ -11,23 +11,24 @@ \section*{Handout 2 (Regular Expression Matching)} -This lecture is about implementing a more efficient regular -expression matcher (the plots on the right)---more efficient -than the matchers from regular expression libraries in Ruby -and Python (the plots on the left). These plots show the -running time for the evil regular expression -$a^?{}^{\{n\}}\cdot a^{\{n\}}$ and strings composed of $n$ -\pcode{a}s. We will use this regular expression and strings as -running example. To see the substantial differences in the two -plots below, note the different scales of the $x$-axes. +This lecture is about implementing a more efficient regular expression +matcher (the plots on the right)---more efficient than the matchers +from regular expression libraries in Ruby, Python and Java (the plots +on the left). The first pair of plots show the running time for the +regular expressions $a^?{}^{\{n\}}\cdot a^{\{n\}}$ and strings composed +of $n$ \pcode{a}s. The second pair of plots show the running time +for the regular expression $(a^*)^*\cdot b$ and also strings composed +of $n$ \pcode{a}s (meaning this regular expression actually does not +match the strings). To see the substantial differences in the left +and right plots below, note the different scales of the $x$-axes. \begin{center} \begin{tabular}{@{}cc@{}} \begin{tikzpicture} \begin{axis}[ - xlabel={strings of {\tt a}s}, - ylabel={time in secs}, + xlabel={\small $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$}, + ylabel={\small time in secs}, enlargelimits=false, xtick={0,5,...,30}, xmax=33, @@ -49,7 +50,49 @@ & \begin{tikzpicture} \begin{axis}[ - xlabel={strings of \texttt{a}s}, + xlabel={\small $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$}, + ylabel={\small time in secs}, + enlargelimits=false, + xtick={0,3000,...,12000}, + xmax=12500, + ymax=35, + ytick={0,5,...,30}, + scaled ticks=false, + axis lines=left, + width=6.5cm, + height=5cm] +\addplot[green,mark=square*,mark options={fill=white}] table {re2b.data}; +\addplot[black,mark=square*,mark options={fill=white}] table {re3.data}; +\end{axis} +\end{tikzpicture} +\end{tabular} +\end{center} + +\begin{center} +\begin{tabular}{@{}cc@{}} +\begin{tikzpicture} +\begin{axis}[ + xlabel={$(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$}, + ylabel={time in secs}, + enlargelimits=false, + xtick={0,5,...,30}, + xmax=33, + ymax=35, + ytick={0,5,...,30}, + scaled ticks=false, + axis lines=left, + width=5cm, + height=5cm, + legend entries={Java}, + legend pos=north west, + legend cell align=left] +\addplot[blue,mark=*, mark options={fill=white}] table {re-java.data}; +\end{axis} +\end{tikzpicture} +& +\begin{tikzpicture} + \begin{axis}[ + xlabel={$(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$}, ylabel={time in secs}, enlargelimits=false, xtick={0,3000,...,12000}, @@ -67,8 +110,11 @@ \end{tabular} \end{center}\medskip +\noindent +We will use these regular expressions and strings +as running examples. -\noindent Having specified in the previous lecture what +Having specified in the previous lecture what problem our regular expression matcher is supposed to solve, namely for any given regular expression $r$ and string $s$ answer \textit{true} if and only if @@ -77,16 +123,15 @@ s \in L(r) \] -\noindent we can look at an algorithm to solve this problem. -Clearly we cannot use the function $L$ directly for this, -because in general the set of strings $L$ returns is infinite -(recall what $L(a^*)$ is). In such cases there is no way we -can implement an exhaustive test for whether a string is -member of this set or not. In contrast our matching algorithm -will operate on the regular expression $r$ and string $s$, -only, which are both finite objects. Before we come to the matching -algorithm, however, let us have a closer look at what it means -when two regular expressions are equivalent. +\noindent we can look at an algorithm to solve this problem. Clearly +we cannot use the function $L$ directly for this, because in general +the set of strings $L$ returns is infinite (recall what $L(a^*)$ is). +In such cases there is no way we can implement an exhaustive test for +whether a string is member of this set or not. In contrast our +matching algorithm will operate on the regular expression $r$ and +string $s$, only, which are both finite objects. Before we explain to +the matching algorithm, however, let us have a closer look at what it +means when two regular expressions are equivalent. \subsection*{Regular Expression Equivalences} @@ -156,16 +201,15 @@ \end{tabular} \end{center} -\noindent which always hold no matter what the regular -expression $r$ looks like. The first two are easy to verify -since $L(\ZERO)$ is the empty set. The next two are also -easy to verify since $L(\ONE) = \{[]\}$ and appending the -empty string to every string of another set, leaves the set -unchanged. Be careful to fully comprehend the fifth and sixth -equivalence: if you concatenate two sets of strings and one is -the empty set, then the concatenation will also be the empty -set. To see this, check the definition of $\_ @ \_$. The -last equivalence is again trivial. +\noindent which always hold no matter what the regular expression $r$ +looks like. The first two are easy to verify since $L(\ZERO)$ is the +empty set. The next two are also easy to verify since $L(\ONE) = +\{[]\}$ and appending the empty string to every string of another set, +leaves the set unchanged. Be careful to fully comprehend the fifth and +sixth equivalence: if you concatenate two sets of strings and one is +the empty set, then the concatenation will also be the empty set. To +see this, check the definition of $\_ @ \_$ for sets. The last +equivalence is again trivial. What will be important later on is that we can orient these equivalences and read them from left to right. In this way we @@ -177,14 +221,14 @@ \label{big} \end{equation} -\noindent If we can find an equivalent regular expression that -is simpler (smaller for example), then this might potentially -make our matching algorithm run faster. The reason is that -whether a string $s$ is in $L(r)$ or in $L(r')$ with $r\equiv -r'$ will always give the same answer. In the example above you -will see that the regular expression is equivalent to just $r_1$. -You can verify this by iteratively applying the simplification -rules from above: +\noindent If we can find an equivalent regular expression that is +simpler (smaller for example), then this might potentially make our +matching algorithm run faster. We can look for such a simpler regular +expression $r'$ because whether a string $s$ is in $L(r)$ or in +$L(r')$ with $r\equiv r'$ will always give the same answer. In the +example above you will see that the regular expression is equivalent +to just $r_1$. You can verify this by iteratively applying the +simplification rules from above: \begin{center} \begin{tabular}{ll} @@ -208,19 +252,19 @@ \subsection*{The Matching Algorithm} The algorithm we will define below consists of two parts. One -is the function $nullable$ which takes a regular expression as +is the function $\textit{nullable}$ which takes a regular expression as argument and decides whether it can match the empty string (this means it returns a boolean in Scala). This can be easily defined recursively as follows: \begin{center} \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}} -$nullable(\ZERO)$ & $\dn$ & $\textit{false}$\\ -$nullable(\ONE)$ & $\dn$ & $true$\\ -$nullable(c)$ & $\dn$ & $\textit{false}$\\ -$nullable(r_1 + r_2)$ & $\dn$ & $nullable(r_1) \vee nullable(r_2)$\\ -$nullable(r_1 \cdot r_2)$ & $\dn$ & $nullable(r_1) \wedge nullable(r_2)$\\ -$nullable(r^*)$ & $\dn$ & $true$ \\ +$\textit{nullable}(\ZERO)$ & $\dn$ & $\textit{false}$\\ +$\textit{nullable}(\ONE)$ & $\dn$ & $\textit{true}$\\ +$\textit{nullable}(c)$ & $\dn$ & $\textit{false}$\\ +$\textit{nullable}(r_1 + r_2)$ & $\dn$ & $\textit{nullable}(r_1) \vee \textit{nullable}(r_2)$\\ +$\textit{nullable}(r_1 \cdot r_2)$ & $\dn$ & $\textit{nullable}(r_1) \wedge \textit{nullable}(r_2)$\\ +$\textit{nullable}(r^*)$ & $\dn$ & $\textit{true}$ \\ \end{tabular} \end{center} @@ -228,7 +272,7 @@ property holds: \[ -nullable(r) \;\;\text{if and only if}\;\; []\in L(r) +\textit{nullable}(r) \;\;\text{if and only if}\;\; []\in L(r) \] \noindent Note on the left-hand side of the if-and-only-if we @@ -239,7 +283,7 @@ The other function of our matching algorithm calculates a \emph{derivative} of a regular expression. This is a function which will take a regular expression, say $r$, and a -character, say $c$, as argument and returns a new regular +character, say $c$, as arguments and returns a new regular expression. Be careful that the intuition behind this function is not so easy to grasp on first reading. Essentially this function solves the following problem: if $r$ can match a @@ -253,7 +297,7 @@ $der\, c\, (\ONE)$ & $\dn$ & $\ZERO$ \\ $der\, c\, (d)$ & $\dn$ & if $c = d$ then $\ONE$ else $\ZERO$\\ $der\, c\, (r_1 + r_2)$ & $\dn$ & $der\, c\, r_1 + der\, c\, r_2$\\ - $der\, c\, (r_1 \cdot r_2)$ & $\dn$ & if $nullable (r_1)$\\ + $der\, c\, (r_1 \cdot r_2)$ & $\dn$ & if $\textit{nullable} (r_1)$\\ & & then $(der\,c\,r_1) \cdot r_2 + der\, c\, r_2$\\ & & else $(der\, c\, r_1) \cdot r_2$\\ $der\, c\, (r^*)$ & $\dn$ & $(der\,c\,r) \cdot (r^*)$ @@ -278,7 +322,9 @@ matched by the regular expression $r_1$ or $r_2$. So we just have to recursively call $der$ with these two regular expressions and compose the results again with $+$. Makes -sense? The $\cdot$-case is more complicated: if $r_1\cdot r_2$ +sense? + +The $\cdot$-case is more complicated: if $r_1\cdot r_2$ matches a string of the form $c\!::\!s$, then the first part must be matched by $r_1$. Consequently, it makes sense to construct the regular expression for $s$ by calling $der$ with @@ -294,7 +340,7 @@ single copy of $r$. Therefore we call recursively $der\,c\,r$ and ``append'' $r^*$ in order to match the rest of $s$. -If this did not make sense, here is another way to rationalise +If this did not make sense yet, here is another way to rationalise the definition of $der$ by considering the following operation on sets: @@ -338,10 +384,10 @@ Step 1: & build derivative of $a$ and $r_1$ & $(r_2 = der\,a\,r_1)$\smallskip\\ Step 2: & build derivative of $b$ and $r_2$ & $(r_3 = der\,b\,r_2)$\smallskip\\ Step 3: & build derivative of $c$ and $r_3$ & $(r_4 = der\,b\,r_3)$\smallskip\\ -Step 4: & the string is exhausted; test & ($nullable(r_4)$)\\ +Step 4: & the string is exhausted; test & ($\textit{nullable}(r_4)$)\\ & whether $r_4$ can recognise the\\ & empty string\smallskip\\ -Output: & result of this test $\Rightarrow true \,\text{or}\, \textit{false}$\\ +Output: & result of this test $\Rightarrow \textit{true} \,\text{or}\, \textit{false}$\\ \end{tabular} \end{center} @@ -350,17 +396,18 @@ do not know yet---but let us assume it is. Then $Der\,a\,L(r_1)$ builds the set where all the strings not starting with $a$ are filtered out. Of the remaining strings, the $a$ is stripped -off. Then we continue with filtering out all strings not +off. So we should still have $bc$ in the set. +Then we continue with filtering out all strings not starting with $b$ and stripping off the $b$ from the remaining strings, that means we build $Der\,b\,(Der\,a\,(L(r_1)))$. Finally we filter out all strings not starting with $c$ and strip off $c$ from the remaining string. This is $Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$. Now if $abc$ was in the original set ($L(r_1)$), then in $Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$ -must be the empty string. If not, then $abc$ was not in the +must contain the empty string. If not, then $abc$ was not in the language we started with. -Our matching algorithm using $der$ and $nullable$ works +Our matching algorithm using $der$ and $\textit{nullable}$ works similarly, just using regular expression instead of sets. For this we need to extend the notion of derivatives from single characters to strings. This can be done using the following @@ -380,7 +427,7 @@ algorithm: \[ -matches\,s\,r \dn nullable(ders\,s\,r) +matches\,s\,r \dn \textit{nullable}(ders\,s\,r) \] \noindent @@ -411,9 +458,9 @@ \begin{figure}[p] \lstinputlisting{../progs/app5.scala} -\caption{Scala implementation of the nullable and +\caption{Scala implementation of the \textit{nullable} and derivative functions. These functions are easy to - implement in functional languages, because pattern + implement in functional languages, because their built-in pattern matching and recursion allow us to mimic the mathematical definitions very closely.\label{scala1}} \end{figure} @@ -478,7 +525,7 @@ \noindent With this fix we have a constant ``size'' regular expression for our running example no matter how large $n$ is. This means we have to also add cases for \pcode{NTIMES} in the -functions $nullable$ and $der$. Does the change have any +functions $\textit{nullable}$ and $der$. Does the change have any effect? \begin{center} @@ -508,9 +555,11 @@ (and our first version) could only handle $n = 27$ or so in 30 second. +SECOND EXAMPLE + The moral is that our algorithm is rather sensitive to the size of regular expressions it needs to handle. This is of -course obvious because both $nullable$ and $der$ frequently +course obvious because both $\textit{nullable}$ and $der$ frequently need to traverse the whole regular expression. There seems, however, one more issue for making the algorithm run faster. The derivative function often produces ``useless'' @@ -580,6 +629,8 @@ \end{tikzpicture} \end{center} +SECOND EXAMPLE + \section*{Proofs} You might not like doing proofs. But they serve a very @@ -625,7 +676,7 @@ property: \begin{equation} -nullable(r) \;\;\text{if and only if}\;\; []\in L(r) +\textit{nullable}(r) \;\;\text{if and only if}\;\; []\in L(r) \label{nullableprop} \end{equation} @@ -635,12 +686,12 @@ above). So we have to show that \[ -nullable(\ZERO) \;\;\text{if and only if}\;\; +\textit{nullable}(\ZERO) \;\;\text{if and only if}\;\; []\in L(\ZERO) \] -\noindent whereby $nullable(\ZERO)$ is by definition of -the function $nullable$ always $\textit{false}$. We also have +\noindent whereby $\textit{nullable}(\ZERO)$ is by definition of +the function $\textit{nullable}$ always $\textit{false}$. We also have that $L(\ZERO)$ is by definition $\{\}$. It is impossible that the empty string $[]$ is in the empty set. Therefore also the right-hand side is false. Consequently we @@ -652,7 +703,7 @@ $P(r_1 + r_2)$, which is \begin{equation} -nullable(r_1 + r_2) \;\;\text{if and only if}\;\; +\textit{nullable}(r_1 + r_2) \;\;\text{if and only if}\;\; []\in L(r_1 + r_2) \label{propalt} \end{equation} @@ -662,17 +713,17 @@ \begin{center} \begin{tabular}{l} -$nullable(r_1) \;\;\text{if and only if}\;\; []\in L(r_1)$ and\\ -$nullable(r_2) \;\;\text{if and only if}\;\; []\in L(r_2)$\\ +$\textit{nullable}(r_1) \;\;\text{if and only if}\;\; []\in L(r_1)$ and\\ +$\textit{nullable}(r_2) \;\;\text{if and only if}\;\; []\in L(r_2)$\\ \end{tabular} \end{center} \noindent These are the induction hypotheses. To check this -case, we can start from $nullable(r_1 + r_2)$, which by +case, we can start from $\textit{nullable}(r_1 + r_2)$, which by definition is \[ -nullable(r_1) \vee nullable(r_2) +\textit{nullable}(r_1) \vee \textit{nullable}(r_2) \] \noindent Using the two induction hypotheses from above, @@ -682,7 +733,7 @@ [] \in L(r_1) \vee []\in(r_2) \] -\noindent We just replaced the $nullable(\ldots)$ parts by +\noindent We just replaced the $\textit{nullable}(\ldots)$ parts by the equivalent $[] \in L(\ldots)$ from the induction hypotheses. A bit of thinking convinces you that if $[] \in L(r_1) \vee []\in L(r_2)$ then the empty string @@ -695,7 +746,7 @@ \noindent but this is by definition of $L$ exactly $[] \in L(r_1 + r_2)$, which we needed to establish according to \eqref{propalt}. What we have shown is that starting from -$nullable(r_1 + r_2)$ we have done equivalent transformations +$\textit{nullable}(r_1 + r_2)$ we have done equivalent transformations to end up with $[] \in L(r_1 + r_2)$. Consequently we have established that $P(r_1 + r_2)$ holds. @@ -769,12 +820,12 @@ \end{equation} \noindent We have also shown that testing whether the empty -string is in a language is equivalent to the $nullable$ +string is in a language is equivalent to the $\textit{nullable}$ function; see \eqref{nullableprop}. That means \eqref{prefinalstep} is equivalent with \[ -nullable(ders\,s\,r) +\textit{nullable}(ders\,s\,r) \] \noindent But this is just the definition of $matches$