diff -r 65d1ea0e989f -r 24531cfaa36a handouts/ho02.tex --- a/handouts/ho02.tex Sat Sep 27 00:37:02 2014 +0100 +++ b/handouts/ho02.tex Sun Sep 28 18:07:58 2014 +0100 @@ -1,15 +1,77 @@ \documentclass{article} \usepackage{../style} \usepackage{../langs} - +\usepackage{../graphics} +\usepackage{../data} \begin{document} \section*{Handout 2} -Having specified what problem our matching algorithm, \pcode{match}, -is supposed to solve, namely for a given regular expression $r$ and -string $s$ answer \textit{true} if and only if +This lecture is about implementing a more efficient regular +expression matcher (the plots on the right)---more efficient +than the matchers from regular expression libraries in Ruby and +Python (the plots on the left). These plots show the running +time for the evil regular expression $a?^{\{n\}}a^{\{n\}}$. + +\begin{center} +\begin{tabular}{@{}cc@{}} +\begin{tikzpicture}[y=.072cm, x=.12cm] + %axis + \draw (0,0) -- coordinate (x axis mid) (30,0); + \draw (0,0) -- coordinate (y axis mid) (0,30); + %ticks + \foreach \x in {0,5,...,30} + \draw (\x,1pt) -- (\x,-3pt) node[anchor=north] {\x}; + \foreach \y in {0,5,...,30} + \draw (1pt,\y) -- (-3pt,\y) node[anchor=east] {\y}; + %labels + \node[below=0.6cm] at (x axis mid) {number of \texttt{a}s}; + \node[rotate=90,left=0.9cm] at (y axis mid) {time in secs}; + %plots + \draw[color=blue] plot[mark=*] + file {re-python.data}; + \draw[color=brown] plot[mark=triangle*] + file {re-ruby.data}; + %legend + \begin{scope}[shift={(4,20)}] + \draw[color=blue] (0,0) -- + plot[mark=*] (0.25,0) -- (0.5,0) + node[right]{\small Python}; + \draw[yshift=-4mm, color=brown] (0,0) -- + plot[mark=triangle*] (0.25,0) -- (0.5,0) + node[right]{\small Ruby}; + \end{scope} +\end{tikzpicture} +& +\begin{tikzpicture}[y=.072cm, x=.0004cm] + %axis + \draw (0,0) -- coordinate (x axis mid) (12000,0); + \draw (0,0) -- coordinate (y axis mid) (0,30); + %ticks + \foreach \x in {0,3000,...,12000} + \draw (\x,1pt) -- (\x,-3pt) node[anchor=north] {\x}; + \foreach \y in {0,5,...,30} + \draw (1pt,\y) -- (-3pt,\y) node[anchor=east] {\y}; + %labels + \node[below=0.6cm] at (x axis mid) {number of \texttt{a}s}; + \node[rotate=90,left=0.9cm] at (y axis mid) {time in secs}; + + %plots + \draw[color=green] plot[mark=square*, mark options={fill=white} ] + file {re2b.data}; + \draw[color=black] plot[mark=square*, mark options={fill=white} ] + file {re3.data}; +\end{tikzpicture} +\end{tabular} +\end{center}\medskip + + +\noindent Having specified in the previous lecture what +problem our regular expression matcher, which we will call +\pcode{matches}, is supposed to solve, namely for any given +regular expression $r$ and string $s$ answer \textit{true} if +and only if \[ s \in L(r) @@ -20,95 +82,219 @@ because in general the set of strings $L$ returns is infinite (recall what $L(a^*)$ is). In such cases there is no way we can implement an exhaustive test for whether a string is -member of this set or not. Before we come to the matching -algorithm, lets have a closer look at what it means when -two regular expressions are equivalent. +member of this set or not. In contrast our matching algorithm +will mainly operate on the regular expression $r$ and string +$s$, which are both finite. Before we come to the matching +algorithm, however, let us have a closer look at what it means +when two regular expressions are equivalent. \subsection*{Regular Expression Equivalences} - -\subsection*{Matching Algorithm} +We already defined in Handout 1 what it means for two regular +expressions to be equivalent, namely if their meaning is the +same language: -The algorithm we will define below consists of two parts. One is the -function $nullable$ which takes a regular expression as argument and -decides whether it can match the empty string (this means it returns a -boolean). This can be easily defined recursively as follows: +\[ +r_1 \equiv r_2 \;\dn\; L(r_1) = L(r_2) +\] + +\noindent +It is relatively easy to verify that some concrete equivalences +hold, for example \begin{center} -\begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}} -$nullable(\varnothing)$ & $\dn$ & $f\!\/alse$\\ -$nullable(\epsilon)$ & $\dn$ & $true$\\ -$nullable(c)$ & $\dn$ & $f\!alse$\\ -$nullable(r_1 + r_2)$ & $\dn$ & $nullable(r_1) \vee nullable(r_2)$\\ -$nullable(r_1 \cdot r_2)$ & $\dn$ & $nullable(r_1) \wedge nullable(r_2)$\\ -$nullable(r^*)$ & $\dn$ & $true$ \\ +\begin{tabular}{rcl} +$(a + b) + c$ & $\equiv$ & $a + (b + c)$\\ +$a + a$ & $\equiv$ & $a$\\ +$a + b$ & $\equiv$ & $b + a$\\ +$(a \cdot b) \cdot c$ & $\equiv$ & $a \cdot (b \cdot c)$\\ +$c \cdot (a + b)$ & $\equiv$ & $(c \cdot a) + (c \cdot b)$\\ \end{tabular} \end{center} \noindent -The idea behind this function is that the following property holds: +but also easy to verify that the following regular expressions +are \emph{not} equivalent + +\begin{center} +\begin{tabular}{rcl} +$a \cdot a$ & $\not\equiv$ & $a$\\ +$a + (b \cdot c)$ & $\not\equiv$ & $(a + b) \cdot (a + c)$\\ +\end{tabular} +\end{center} + +\noindent I leave it to you to verify these equivalences and +non-equivalences. It is also interesting to look at some +corner cases involving $\epsilon$ and $\varnothing$: + +\begin{center} +\begin{tabular}{rcl} +$a \cdot \varnothing$ & $\not\equiv$ & $a$\\ +$a + \epsilon$ & $\not\equiv$ & $a$\\ +$\epsilon$ & $\equiv$ & $\varnothing^*$\\ +$\epsilon^*$ & $\equiv$ & $\epsilon$\\ +$\varnothing^*$ & $\not\equiv$ & $\varnothing$\\ +\end{tabular} +\end{center} + +\noindent Again I leave it to you to make sure you agree +with these equivalences and non-equivalences. + + +For our matching algorithm however the following six +equivalences will play an important role: + +\begin{center} +\begin{tabular}{rcl} +$r + \varnothing$ & $\equiv$ & $r$\\ +$\varnothing + r$ & $\equiv$ & $r$\\ +$r \cdot \epsilon$ & $\equiv$ & $r$\\ +$\epsilon \cdot r$ & $\equiv$ & $r$\\ +$r \cdot \varnothing$ & $\equiv$ & $\varnothing$\\ +$\varnothing \cdot r$ & $\equiv$ & $\varnothing$\\ +$r + r$ & $\equiv$ & $r$ +\end{tabular} +\end{center} + +\noindent which always hold no matter what the regular +expression $r$ looks like. The first are easy to verify since +$L(\varnothing)$ is the empty set. The next two are also easy +to verify since $L(\epsilon) = \{[]\}$ and appending the empty +string to every string of another set, leaves the set +unchanged. Be careful to fully comprehend the fifth and +sixth equivalence: if you concatenate two sets of strings +and one is the empty set, then the concatenation will also be +the empty set. Check the definition of \pcode{_ @ _}. +The last equivalence is again trivial. + + +What will be important later on is that we can orient these +equivalences and read them from left to right. In this way we +can view them as \emph{simplification rules}. Suppose for +example the regular expression + +\begin{equation} +(r_1 + \varnothing) \cdot \epsilon + ((\epsilon + r_2) + r_3) \cdot (r_4 \cdot \varnothing) +\label{big} +\end{equation} + +\noindent If we can find an equivalent regular expression that +is simpler (smaller for example), then this might potentially +make our matching algorithm is faster. The reason is that +whether a string $s$ is in $L(r)$ or in $L(r')$ with $r\equiv r'$ +will always give the same answer. In the example above you +will see that the regular expression is equivalent to $r_1$ +if you iteratively apply the simplification rules from above: + +\begin{center} +\begin{tabular}{ll} + & $(r_1 + \varnothing) \cdot \epsilon + ((\epsilon + r_2) + r_3) \cdot +(\underline{r_4 \cdot \varnothing})$\smallskip\\ +$\equiv$ & $(r_1 + \varnothing) \cdot \epsilon + \underline{((\epsilon + r_2) + r_3) \cdot +\varnothing}$\smallskip\\ +$\equiv$ & $\underline{(r_1 + \varnothing) \cdot \epsilon} + \varnothing$\smallskip\\ +$\equiv$ & $(\underline{r_1 + \varnothing}) + \varnothing$\smallskip\\ +$\equiv$ & $\underline{r_1 + \varnothing}$\smallskip\\ +$\equiv$ & $r_1$\ +\end{tabular} +\end{center} + +\noindent In each step I underlined where a simplification +rule is applied. Our matching algorithm in the next section +will often generate such ``useless'' $\epsilon$s and +$\varnothing$s, therefore simplifying them away will make the +algorithm quite a bit faster. + +\subsection*{The Matching Algorithm} + +The algorithm we will define below consists of two parts. One +is the function $nullable$ which takes a regular expression as +argument and decides whether it can match the empty string +(this means it returns a boolean in Scala). This can be easily +defined recursively as follows: + +\begin{center} +\begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}} +$nullable(\varnothing)$ & $\dn$ & $\textit{false}$\\ +$nullable(\epsilon)$ & $\dn$ & $true$\\ +$nullable(c)$ & $\dn$ & $\textit{false}$\\ +$nullable(r_1 + r_2)$ & $\dn$ & $nullable(r_1) \vee nullable(r_2)$\\ +$nullable(r_1 \cdot r_2)$ & $\dn$ & $nullable(r_1) \wedge nullable(r_2)$\\ +$nullable(r^*)$ & $\dn$ & $true$ \\ +\end{tabular} +\end{center} + +\noindent The idea behind this function is that the following +property holds: \[ -nullable(r) \;\;\text{if and only if}\;\; ""\in L(r) +nullable(r) \;\;\text{if and only if}\;\; []\in L(r) \] -\noindent -Note on the left-hand side we have a function we can implement; on the -right we have its specification. +\noindent Note on the left-hand side we have a function we can +implement; on the right we have its specification (which we +cannot implement in a programming language). The other function of our matching algorithm calculates a -\emph{derivative} of a regular expression. This is a function which -will take a regular expression, say $r$, and a character, say $c$, as -argument and return a new regular expression. Be careful that the -intuition behind this function is not so easy to grasp on first -reading. Essentially this function solves the following problem: if -$r$ can match a string of the form $c\!::\!s$, what does the regular -expression look like that can match just $s$. The definition of this -function is as follows: +\emph{derivative} of a regular expression. This is a function +which will take a regular expression, say $r$, and a +character, say $c$, as argument and return a new regular +expression. Be careful that the intuition behind this function +is not so easy to grasp on first reading. Essentially this +function solves the following problem: if $r$ can match a +string of the form $c\!::\!s$, what does the regular +expression look like that can match just $s$. The definition +of this function is as follows: \begin{center} -\begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {\hspace{-10mm}}l@ {}} - $der\, c\, (\varnothing)$ & $\dn$ & $\varnothing$ & \\ - $der\, c\, (\epsilon)$ & $\dn$ & $\varnothing$ & \\ - $der\, c\, (d)$ & $\dn$ & if $c = d$ then $\epsilon$ else $\varnothing$ & \\ - $der\, c\, (r_1 + r_2)$ & $\dn$ & $der\, c\, r_1 + der\, c\, r_2$ & \\ +\begin{tabular}{l@ {\hspace{2mm}}c@ {\hspace{2mm}}l} + $der\, c\, (\varnothing)$ & $\dn$ & $\varnothing$\\ + $der\, c\, (\epsilon)$ & $\dn$ & $\varnothing$ \\ + $der\, c\, (d)$ & $\dn$ & if $c = d$ then $\epsilon$ else $\varnothing$\\ + $der\, c\, (r_1 + r_2)$ & $\dn$ & $der\, c\, r_1 + der\, c\, r_2$\\ $der\, c\, (r_1 \cdot r_2)$ & $\dn$ & if $nullable (r_1)$\\ & & then $(der\,c\,r_1) \cdot r_2 + der\, c\, r_2$\\ & & else $(der\, c\, r_1) \cdot r_2$\\ - $der\, c\, (r^*)$ & $\dn$ & $(der\,c\,r) \cdot (r^*)$ & + $der\, c\, (r^*)$ & $\dn$ & $(der\,c\,r) \cdot (r^*)$ \end{tabular} \end{center} -\noindent -The first two clauses can be rationalised as follows: recall that -$der$ should calculate a regular expression, if the ``input'' regular -expression can match a string of the form $c\!::\!s$. Since neither -$\varnothing$ nor $\epsilon$ can match such a string we return -$\varnothing$. In the third case we have to make a case-distinction: -In case the regular expression is $c$, then clearly it can recognise a +\noindent The first two clauses can be rationalised as +follows: recall that $der$ should calculate a regular +expression, if the ``input'' regular expression can match a +string of the form $c\!::\!s$. Since neither $\varnothing$ nor +$\epsilon$ can match such a string we return $\varnothing$. In +the third case we have to make a case-distinction: In case the +regular expression is $c$, then clearly it can recognise a string of the form $c\!::\!s$, just that $s$ is the empty -string. Therefore we return the $\epsilon$-regular expression. In the -other case we again return $\varnothing$ since no string of the -$c\!::\!s$ can be matched. The $+$-case is relatively -straightforward: all strings of the form $c\!::\!s$ are either matched -by the regular expression $r_1$ or $r_2$. So we just have to -recursively call $der$ with these two regular expressions and compose -the results again with $+$. The $\cdot$-case is more complicated: if -$r_1\cdot r_2$ matches a string of the form $c\!::\!s$, then the first -part must be matched by $r_1$. Consequently, it makes sense to -construct the regular expression for $s$ by calling $der$ with $r_1$ -and ``appending'' $r_2$. There is however one exception to this simple -rule: if $r_1$ can match the empty string, then all of $c\!::\!s$ is -matched by $r_2$. So in case $r_1$ is nullable (that is can match the -empty string) we have to allow the choice $der\,c\,r_2$ for -calculating the regular expression that can match $s$. The $*$-case is -again simple: if $r^*$ matches a string of the form $c\!::\!s$, then -the first part must be ``matched'' by a single copy of $r$. Therefore -we call recursively $der\,c\,r$ and ``append'' $r^*$ in order to match -the rest of $s$. +string. Therefore we return the $\epsilon$-regular expression. +In the other case we again return $\varnothing$ since no +string of the $c\!::\!s$ can be matched. Next come the +recursive cases. Fortunately, the $+$-case is still relatively +straightforward: all strings of the form $c\!::\!s$ are either +matched by the regular expression $r_1$ or $r_2$. So we just +have to recursively call $der$ with these two regular +expressions and compose the results again with $+$. Yes, makes +sense? The $\cdot$-case is more complicated: if $r_1\cdot r_2$ +matches a string of the form $c\!::\!s$, then the first part +must be matched by $r_1$. Consequently, it makes sense to +construct the regular expression for $s$ by calling $der$ with +$r_1$ and ``appending'' $r_2$. There is however one exception +to this simple rule: if $r_1$ can match the empty string, then +all of $c\!::\!s$ is matched by $r_2$. So in case $r_1$ is +nullable (that is can match the empty string) we have to allow +the choice $der\,c\,r_2$ for calculating the regular +expression that can match $s$. Therefore we have to +add the regular expression $der\,c\,r_2$. +The $*$-case is again simple: +if $r^*$ matches a string of the form $c\!::\!s$, then the +first part must be ``matched'' by a single copy of $r$. +Therefore we call recursively $der\,c\,r$ and ``append'' $r^*$ +in order to match the rest of $s$. -Another way to rationalise the definition of $der$ is to consider the -following operation on sets: +If this did not make sense, here is another way to rationalise +the definition of $der$ by considering the following operation +on sets: \[ Der\,c\,A\;\dn\;\{s\,|\,c\!::\!s \in A\} @@ -117,11 +303,11 @@ \noindent which essentially transforms a set of strings $A$ by filtering out all strings that do not start with $c$ and then strips off the $c$ from -all the remaining strings. For example suppose $A = \{"f\!oo", "bar", -"f\!rak"\}$ then +all the remaining strings. For example suppose $A = \{f\!oo, bar, +f\!rak\}$ then \[ -Der\,f\,A = \{"oo", "rak"\}\quad,\quad -Der\,b\,A = \{"ar"\} \quad \text{and} \quad +Der\,f\,A = \{oo, rak\}\quad,\quad +Der\,b\,A = \{ar\} \quad \text{and} \quad Der\,a\,A = \varnothing \] @@ -141,61 +327,126 @@ from the remaining strings---this is exactly the language that $der\,c\,r$ can match. -If we want to find out whether the string $"abc"$ is matched by the -regular expression $r$ then we can iteratively apply $Der$ as follows +If we want to find out whether the string $abc$ is matched by +the regular expression $r_1$ then we can iteratively apply $der$ +as follows + +\begin{center} +\begin{tabular}{rll} +Input: $r_1$, $abc$\medskip\\ +Step 1: & build derivative of $a$ and $r_1$ & $(r_2 = der\,a\,r_1)$\smallskip\\ +Step 2: & build derivative of $b$ and $r_2$ & $(r_3 = der\,b\,r_2)$\smallskip\\ +Step 3: & build derivative of $c$ and $r_3$ & $(r_4 = der\,b\,r_3)$\smallskip\\ +Step 4: & the string is exhausted; test & ($nullable(r_4)$)\\ + & whether $r_4$ can recognise the\\ + & empty string\smallskip\\ +Output: & result of the test $\Rightarrow true \,\text{or}\, \textit{false}$\\ +\end{tabular} +\end{center} -\begin{enumerate} -\item $Der\,a\,(L(r))$ -\item $Der\,b\,(Der\,a\,(L(r)))$ -\item $Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$ -\end{enumerate} +\noindent Again the operation $Der$ might help to rationalise +this algorithm. We want to know whether $abc \in L(r_1)$. We +do not know yet. But lets assume it is. Then $Der\,a\,L(r_1)$ +builds the set where all the strings not starting with $a$ are +filtered out. Of the remaining strings, the $a$ is stripped +off. Then we continue with filtering out all strings not +starting with $b$ and stripping off the $b$ from the remaining +strings, that means we build $Der\,b\,(Der\,a\,(L(r_1)))$. +Finally we filter out all strings not starting with $c$ and +strip off $c$ from the remaining string. This is +$Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$. Now if $abc$ was in the +original set ($L(r_1)$), then in $Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$ +must be the empty string. If not then $abc$ was not in the +language we started with. -\noindent -In the last step we need to test whether the empty string is in the -set. Our matching algorithm will work similarly, just using regular -expression instead of sets. For this we need to lift the notion of -derivatives from characters to strings. This can be done using the -following function, taking a string and regular expression as input -and a regular expression as output. +Our matching algorithm using $der$ and $nullable$ works +similarly, just using regular expression instead of sets. For +this we need to extend the notion of derivatives from +characters to strings. This can be done using the following +function, taking a string and regular expression as input and +a regular expression as output. \begin{center} \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {\hspace{-10mm}}l@ {}} - $der\!s\, []\, r$ & $\dn$ & $r$ & \\ - $der\!s\, (c\!::\!s)\, r$ & $\dn$ & $der\!s\,s\,(der\,c\,r)$ & \\ + $\textit{ders}\, []\, r$ & $\dn$ & $r$ & \\ + $\textit{ders}\, (c\!::\!s)\, r$ & $\dn$ & $\textit{ders}\,s\,(der\,c\,r)$ & \\ \end{tabular} \end{center} -\noindent -Having $ders$ in place, we can finally define our matching algorithm: +\noindent This function essentially iterates $der$ taking one +character at the time from the original string until it is +exhausted. Having $ders$ in place, we can finally define our +matching algorithm: \[ -match\,s\,r = nullable(ders\,s\,r) -\] - -\noindent -We claim that - -\[ -match\,s\,r\quad\text{if and only if}\quad s\in L(r) +matches\,s\,r = nullable(ders\,s\,r) \] \noindent -holds, which means our algorithm satisfies the specification. This -algorithm was introduced by Janus Brzozowski in 1964. Its main -attractions are simplicity and being fast, as well as being easily -extendable for other regular expressions such as $r^{\{n\}}$, $r^?$, -$\sim{}r$ and so on. +We can claim that + +\[ +matches\,s\,r\quad\text{if and only if}\quad s\in L(r) +\] + +\noindent holds, which means our algorithm satisfies the +specification. Of course we can claim many things\ldots +whether the claim holds any water is a different question, +which for example is the point of the Strand-2 Coursework. + +This algorithm was introduced by Janus Brzozowski in 1964. Its +main attractions are simplicity and being fast, as well as +being easily extendable for other regular expressions such as +$r^{\{n\}}$, $r^?$, $\sim{}r$ and so on (this is subject of +Strand-1 Coursework 1). \subsection*{The Matching Algorithm in Scala} - +Another attraction of the algorithm is that it can be easily +implemented in a functional programming language, like Scala. +Given the implementation of regular expressions in Scala given +in the first lecture and handout, the functions for +\pcode{matches} are shown in Figure~\ref{scala1}. \begin{figure}[p] -{\lstset{language=Scala}\texttt{\lstinputlisting{../progs/app5.scala}}} -{\lstset{language=Scala}\texttt{\lstinputlisting{../progs/app6.scala}}} -\caption{Scala implementation of the nullable and derivatives functions.} +\lstinputlisting{../progs/app5.scala} +\caption{Scala implementation of the nullable and + derivatives functions.\label{scala1}} \end{figure} +For running the algorithm with our favourite example, the evil +regular expression $a?^{\{n\}}a^{\{n\}}$, we need to implement +the optional regular expression and the exactly $n$-times +regular expression. This can be done with the translations + +\lstinputlisting[numbers=none]{../progs/app51.scala} + +\noindent Running the matcher with the example, we find it is +slightly worse then the matcher in Ruby and Python. +Ooops\ldots\medskip + +\noindent Analysing this failure a bit we notice that +for $a^{\{n\}}$ we generate quite big regular expressions: + +\begin{center} +\begin{tabular}{rl} +1: & $a$\\ +2: & $a\cdot a$\\ +3: & $a\cdot a\cdot a$\\ +& \ldots\\ +13: & $a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a$\\ +& \ldots\\ +20: +\end{tabular} +\end{center} + +\noindent Our algorithm traverses such regular expressions at +least once every time a derivative is calculated. So having +large regular expressions, will cause problems. This problem +is aggravated with $a?$ being represented as $a + \epsilon$. + + + \end{document} %%% Local Variables: