\documentclass{article}\usepackage{../style}\usepackage{../langs}\usepackage{../graphics}\usepackage{../data}\pgfplotsset{compat=1.11}\begin{document}\section*{Handout 2 (Regular Expression Matching)}This lecture is about implementing a more efficient regularexpression matcher (the plots on the right)---more efficientthan the matchers from regular expression libraries in Rubyand Python (the plots on the left). These plots show therunning time for the evil regular expression$a?^{\{n\}}a^{\{n\}}$ and string composed of $n$ \pcode{a}s.We will use this regular expression and strings as runningexample. To see the substantial differences in the two plotsbelow, note the different scales of the $x$-axes. \begin{center}\begin{tabular}{@{}cc@{}}\begin{tikzpicture}\begin{axis}[xlabel={\pcode{a}s},ylabel={time in secs}, enlargelimits=false, xtick={0,5,...,30}, xmax=33, ymax=35, ytick={0,5,...,30}, scaled ticks=false, axis lines=left, width=5cm, height=5cm, legend entries={Python,Ruby}, legend pos=north west, legend cell align=left]\addplot[blue,mark=*, mark options={fill=white}] table {re-python.data};\addplot[brown,mark=triangle*, mark options={fill=white}] table {re-ruby.data}; \end{axis}\end{tikzpicture}&\begin{tikzpicture} \begin{axis}[xlabel={\pcode{a}s},ylabel={time in secs}, enlargelimits=false, xtick={0,3000,...,12000}, xmax=12500, ymax=35, ytick={0,5,...,30}, scaled ticks=false, axis lines=left, width=6.5cm, height=5cm]\addplot[green,mark=square*,mark options={fill=white}] table {re2b.data};\addplot[black,mark=square*,mark options={fill=white}] table {re3.data};\end{axis}\end{tikzpicture}\end{tabular}\end{center}\medskip\noindent Having specified in the previous lecture whatproblem our regular expression matcher is supposed to solve,namely for any given regular expression $r$ and string $s$answer \textit{true} if and only if\[s \in L(r)\]\noindent we can look at an algorithm to solve this problem.Clearly we cannot use the function $L$ directly for this,because in general the set of strings $L$ returns is infinite(recall what $L(a^*)$ is). In such cases there is no way wecan implement an exhaustive test for whether a string ismember of this set or not. In contrast our matching algorithmwill operate on the regular expression $r$ and string $s$,only, which are both finite. Before we come to the matchingalgorithm, however, let us have a closer look at what it meanswhen two regular expressions are equivalent.\subsection*{Regular Expression Equivalences}We already defined in Handout 1 what it means for two regularexpressions to be equivalent, namely if their meaning is thesame language:\[r_1 \equiv r_2 \;\dn\; L(r_1) = L(r_2)\]\noindentIt is relatively easy to verify that some concrete equivalenceshold, for example\begin{center}\begin{tabular}{rcl}$(a + b) + c$ & $\equiv$ & $a + (b + c)$\\$a + a$ & $\equiv$ & $a$\\$a + b$ & $\equiv$ & $b + a$\\$(a \cdot b) \cdot c$ & $\equiv$ & $a \cdot (b \cdot c)$\\$c \cdot (a + b)$ & $\equiv$ & $(c \cdot a) + (c \cdot b)$\\\end{tabular}\end{center}\noindentbut also easy to verify that the following regular expressionsare \emph{not} equivalent\begin{center}\begin{tabular}{rcl}$a \cdot a$ & $\not\equiv$ & $a$\\$a + (b \cdot c)$ & $\not\equiv$ & $(a + b) \cdot (a + c)$\\\end{tabular}\end{center}\noindent I leave it to you to verify these equivalences andnon-equivalences. It is also interesting to look at somecorner cases involving $\epsilon$ and $\varnothing$:\begin{center}\begin{tabular}{rcl}$a \cdot \varnothing$ & $\not\equiv$ & $a$\\$a + \epsilon$ & $\not\equiv$ & $a$\\$\epsilon$ & $\equiv$ & $\varnothing^*$\\$\epsilon^*$ & $\equiv$ & $\epsilon$\\$\varnothing^*$ & $\not\equiv$ & $\varnothing$\\\end{tabular}\end{center}\noindent Again I leave it to you to make sure you agreewith these equivalences and non-equivalences. For our matching algorithm however the following sevenequivalences will play an important role:\begin{center}\begin{tabular}{rcl}$r + \varnothing$ & $\equiv$ & $r$\\$\varnothing + r$ & $\equiv$ & $r$\\$r \cdot \epsilon$ & $\equiv$ & $r$\\$\epsilon \cdot r$ & $\equiv$ & $r$\\$r \cdot \varnothing$ & $\equiv$ & $\varnothing$\\$\varnothing \cdot r$ & $\equiv$ & $\varnothing$\\$r + r$ & $\equiv$ & $r$\end{tabular}\end{center}\noindent which always hold no matter what the regularexpression $r$ looks like. The first two are easy to verifysince $L(\varnothing)$ is the empty set. The next two are alsoeasy to verify since $L(\epsilon) = \{[]\}$ and appending theempty string to every string of another set, leaves the setunchanged. Be careful to fully comprehend the fifth and sixthequivalence: if you concatenate two sets of strings and one isthe empty set, then the concatenation will also be the emptyset. To see this, check the definition of \pcode{_ @ _}. Thelast equivalence is again trivial.What will be important later on is that we can orient theseequivalences and read them from left to right. In this way wecan view them as \emph{simplification rules}. Consider for example the regular expression \begin{equation}(r_1 + \varnothing) \cdot \epsilon + ((\epsilon + r_2) + r_3) \cdot (r_4 \cdot \varnothing)\label{big}\end{equation}\noindent If we can find an equivalent regular expression thatis simpler (smaller for example), then this might potentiallymake our matching algorithm run faster. The reason is thatwhether a string $s$ is in $L(r)$ or in $L(r')$ with $r\equivr'$ will always give the same answer. In the example above youwill see that the regular expression is equivalent to $r_1$.You can verify this by iteratively applying the simplificationrules from above:\begin{center}\begin{tabular}{ll} & $(r_1 + \varnothing) \cdot \epsilon + ((\epsilon + r_2) + r_3) \cdot (\underline{r_4 \cdot \varnothing})$\smallskip\\$\equiv$ & $(r_1 + \varnothing) \cdot \epsilon + \underline{((\epsilon + r_2) + r_3) \cdot \varnothing}$\smallskip\\$\equiv$ & $\underline{(r_1 + \varnothing) \cdot \epsilon} + \varnothing$\smallskip\\$\equiv$ & $(\underline{r_1 + \varnothing}) + \varnothing$\smallskip\\$\equiv$ & $\underline{r_1 + \varnothing}$\smallskip\\$\equiv$ & $r_1$\\end{tabular}\end{center}\noindent In each step, I underlined where a simplificationrule is applied. Our matching algorithm in the next sectionwill often generate such ``useless'' $\epsilon$s and$\varnothing$s, therefore simplifying them away will make thealgorithm quite a bit faster.\subsection*{The Matching Algorithm}The algorithm we will define below consists of two parts. Oneis the function $nullable$ which takes a regular expression asargument and decides whether it can match the empty string(this means it returns a boolean in Scala). This can be easilydefined recursively as follows:\begin{center}\begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}$nullable(\varnothing)$ & $\dn$ & $\textit{false}$\\$nullable(\epsilon)$ & $\dn$ & $true$\\$nullable(c)$ & $\dn$ & $\textit{false}$\\$nullable(r_1 + r_2)$ & $\dn$ & $nullable(r_1) \vee nullable(r_2)$\\ $nullable(r_1 \cdot r_2)$ & $\dn$ & $nullable(r_1) \wedge nullable(r_2)$\\$nullable(r^*)$ & $\dn$ & $true$ \\\end{tabular}\end{center}\noindent The idea behind this function is that the followingproperty holds:\[nullable(r) \;\;\text{if and only if}\;\; []\in L(r)\]\noindent Note on the left-hand side of the if-and-only-if wehave a function we can implement; on the right we have itsspecification (which we cannot implement in a programminglanguage).The other function of our matching algorithm calculates a\emph{derivative} of a regular expression. This is a functionwhich will take a regular expression, say $r$, and acharacter, say $c$, as argument and returns a new regularexpression. Be careful that the intuition behind this functionis not so easy to grasp on first reading. Essentially thisfunction solves the following problem: if $r$ can match astring of the form $c\!::\!s$, what does the regularexpression look like that can match just $s$? The definitionof this function is as follows:\begin{center}\begin{tabular}{l@ {\hspace{2mm}}c@ {\hspace{2mm}}l} $der\, c\, (\varnothing)$ & $\dn$ & $\varnothing$\\ $der\, c\, (\epsilon)$ & $\dn$ & $\varnothing$ \\ $der\, c\, (d)$ & $\dn$ & if $c = d$ then $\epsilon$ else $\varnothing$\\ $der\, c\, (r_1 + r_2)$ & $\dn$ & $der\, c\, r_1 + der\, c\, r_2$\\ $der\, c\, (r_1 \cdot r_2)$ & $\dn$ & if $nullable (r_1)$\\ & & then $(der\,c\,r_1) \cdot r_2 + der\, c\, r_2$\\ & & else $(der\, c\, r_1) \cdot r_2$\\ $der\, c\, (r^*)$ & $\dn$ & $(der\,c\,r) \cdot (r^*)$ \end{tabular}\end{center}\noindent The first two clauses can be rationalised asfollows: recall that $der$ should calculate a regularexpression so that given the ``input'' regular expression canmatch a string of the form $c\!::\!s$, we want a regularexpression for $s$. Since neither $\varnothing$ nor $\epsilon$can match a string of the form $c\!::\!s$, we return$\varnothing$. In the third case we have to make acase-distinction: In case the regular expression is $c$, thenclearly it can recognise a string of the form $c\!::\!s$, justthat $s$ is the empty string. Therefore we return the$\epsilon$-regular expression. In the other case we againreturn $\varnothing$ since no string of the $c\!::\!s$ can bematched. Next come the recursive cases, which are a bit moreinvolved. Fortunately, the $+$-case is still relativelystraightforward: all strings of the form $c\!::\!s$ are eithermatched by the regular expression $r_1$ or $r_2$. So we justhave to recursively call $der$ with these two regularexpressions and compose the results again with $+$. Makessense? The $\cdot$-case is more complicated: if $r_1\cdot r_2$matches a string of the form $c\!::\!s$, then the first partmust be matched by $r_1$. Consequently, it makes sense toconstruct the regular expression for $s$ by calling $der$ with$r_1$ and ``appending'' $r_2$. There is however one exceptionto this simple rule: if $r_1$ can match the empty string, thenall of $c\!::\!s$ is matched by $r_2$. So in case $r_1$ isnullable (that is can match the empty string) we have to allowthe choice $der\,c\,r_2$ for calculating the regularexpression that can match $s$. Therefore we have to add theregular expression $der\,c\,r_2$ in the result. The $*$-caseis again simple: if $r^*$ matches a string of the form$c\!::\!s$, then the first part must be ``matched'' by asingle copy of $r$. Therefore we call recursively $der\,c\,r$and ``append'' $r^*$ in order to match the rest of $s$.If this did not make sense, here is another way to rationalisethe definition of $der$ by considering the following operationon sets:\[Der\,c\,A\;\dn\;\{s\,|\,c\!::\!s \in A\}\]\noindent This operation essentially transforms a set ofstrings $A$ by filtering out all strings that do not startwith $c$ and then strips off the $c$ from all the remainingstrings. For example suppose $A = \{f\!oo, bar, f\!rak\}$ then\[ Der\,f\,A = \{oo, rak\}\quad,\quad Der\,b\,A = \{ar\} \quad\text{and} \quad Der\,a\,A = \varnothing \]\noindentNote that in the last case $Der$ is empty, because no string in $A$starts with $a$. With this operation we can state the followingproperty about $der$:\[L(der\,c\,r) = Der\,c\,(L(r))\]\noindentThis property clarifies what regular expression $der$ calculates,namely take the set of strings that $r$ can match (that is $L(r)$),filter out all strings not starting with $c$ and strip off the $c$from the remaining strings---this is exactly the language that$der\,c\,r$ can match.If we want to find out whether the string $abc$ is matched bythe regular expression $r_1$ then we can iteratively apply $der$as follows\begin{center}\begin{tabular}{rll}Input: $r_1$, $abc$\medskip\\Step 1: & build derivative of $a$ and $r_1$ & $(r_2 = der\,a\,r_1)$\smallskip\\Step 2: & build derivative of $b$ and $r_2$ & $(r_3 = der\,b\,r_2)$\smallskip\\Step 3: & build derivative of $c$ and $r_3$ & $(r_4 = der\,b\,r_3)$\smallskip\\Step 4: & the string is exhausted; test & ($nullable(r_4)$)\\ & whether $r_4$ can recognise the\\ & empty string\smallskip\\Output: & result of this test $\Rightarrow true \,\text{or}\, \textit{false}$\\ \end{tabular}\end{center}\noindent Again the operation $Der$ might help to rationalisethis algorithm. We want to know whether $abc \in L(r_1)$. Wedo not know yet---but let us assume it is. Then $Der\,a\,L(r_1)$builds the set where all the strings not starting with $a$ arefiltered out. Of the remaining strings, the $a$ is strippedoff. Then we continue with filtering out all strings notstarting with $b$ and stripping off the $b$ from the remainingstrings, that means we build $Der\,b\,(Der\,a\,(L(r_1)))$.Finally we filter out all strings not starting with $c$ andstrip off $c$ from the remaining string. This is$Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$. Now if $abc$ was in the original set ($L(r_1)$), then in $Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$ must be the empty string. If not, then $abc$ was not in the language we started with.Our matching algorithm using $der$ and $nullable$ workssimilarly, just using regular expression instead of sets. Forthis we need to extend the notion of derivatives from singlecharacters to strings. This can be done using the followingfunction, taking a string and regular expression as input anda regular expression as output.\begin{center}\begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {\hspace{-10mm}}l@ {}} $\textit{ders}\, []\, r$ & $\dn$ & $r$ & \\ $\textit{ders}\, (c\!::\!s)\, r$ & $\dn$ & $\textit{ders}\,s\,(der\,c\,r)$ & \\ \end{tabular}\end{center}\noindent This function iterates $der$ taking one character atthe time from the original string until it is exhausted.Having $ders$ in place, we can finally define our matchingalgorithm:\[matches\,s\,r \dn nullable(ders\,s\,r)\]\noindentand we can claim that\[matches\,s\,r\quad\text{if and only if}\quad s\in L(r)\]\noindent holds, which means our algorithm satisfies thespecification. Of course we can claim many things\ldotswhether the claim holds any water is a different question,which for example is the point of the Strand-2 Coursework.This algorithm was introduced by Janus Brzozowski in 1964. Itsmain attractions are simplicity and being fast, as well asbeing easily extendable for other regular expressions such as$r^{\{n\}}$, $r^?$, $\sim{}r$ and so on (this is subject ofStrand-1 Coursework 1).\subsection*{The Matching Algorithm in Scala}Another attraction of the algorithm is that it can be easilyimplemented in a functional programming language, like Scala.Given the implementation of regular expressions in Scala shownin the first lecture and handout, the functions and subfunctionsfor \pcode{matches} are shown in Figure~\ref{scala1}.\begin{figure}[p]\lstinputlisting{../progs/app5.scala}\caption{Scala implementation of the nullable and derivatives functions. These functions are easy to implement in functional languages, because pattern matching and recursion allow us to mimic the mathematical definitions very closely.\label{scala1}}\end{figure}For running the algorithm with our favourite example, the evilregular expression $a?^{\{n\}}a^{\{n\}}$, we need to implementthe optional regular expression and the exactly $n$-timesregular expression. This can be done with the translations\lstinputlisting[numbers=none]{../progs/app51.scala}\noindent Running the matcher with the example, we find it isslightly worse then the matcher in Ruby and Python.Ooops\ldots\begin{center}\begin{tikzpicture}\begin{axis}[ xlabel={\pcode{a}s}, ylabel={time in secs}, enlargelimits=false, xtick={0,5,...,30}, xmax=30, ytick={0,5,...,30}, scaled ticks=false, axis lines=left, width=6cm, height=5cm, legend entries={Python,Ruby,Scala V1}, legend pos=outer north east, legend cell align=left ]\addplot[blue,mark=*, mark options={fill=white}] table {re-python.data};\addplot[brown,mark=pentagon*, mark options={fill=white}] table {re-ruby.data}; \addplot[red,mark=triangle*,mark options={fill=white}] table {re1.data}; \end{axis}\end{tikzpicture}\end{center}\noindent Analysing this failure we notice that for$a^{\{n\}}$ we generate quite big regular expressions:\begin{center}\begin{tabular}{rl}1: & $a$\\2: & $a\cdot a$\\3: & $a\cdot a\cdot a$\\& \ldots\\13: & $a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a$\\& \ldots\end{tabular}\end{center}\noindent Our algorithm traverses such regular expressions atleast once every time a derivative is calculated. So havinglarge regular expressions will cause problems. This problemis aggravated by $a?$ being represented as $a + \epsilon$.We can however fix this by having an explicit constructor for$r^{\{n\}}$. In Scala we would introduce a constructor like\begin{center}\code{case class NTIMES(r: Rexp, n: Int) extends Rexp}\end{center}\noindent With this fix we have a constant ``size'' regularexpression for our running example no matter how large $n$ is.This means we have to also add cases for \pcode{NTIMES} in thefunctions $nullable$ and $der$. Does the change have anyeffect?\begin{center}\begin{tikzpicture}\begin{axis}[ xlabel={\pcode{a}s}, ylabel={time in secs}, enlargelimits=false, xtick={0,100,...,1000}, xmax=1000, ytick={0,5,...,30}, scaled ticks=false, axis lines=left, width=9.5cm, height=5cm, legend entries={Python,Ruby,Scala V1,Scala V2}, legend pos=outer north east, legend cell align=left ]\addplot[blue,mark=*, mark options={fill=white}] table {re-python.data};\addplot[brown,mark=pentagon*, mark options={fill=white}] table {re-ruby.data}; \addplot[red,mark=triangle*,mark options={fill=white}] table {re1.data}; \addplot[green,mark=square*,mark options={fill=white}] table {re2b.data};\end{axis}\end{tikzpicture}\end{center}\noindent Now we are talking business! The modified matcher can within 30 seconds handle regular expressions up to$n = 950$ before a StackOverflow is raised.The moral is that our algorithm is rather sensitive to thesize of regular expressions it needs to handle. This is ofcourse obvious because both $nullable$ and $der$ frequentlyneed to traverse the whole regular expression. There seems,however, one more issue for making the algorithm run faster.The derivative function often produces ``useless''$\varnothing$s and $\epsilon$s. To see this, consider $r = ((a\cdot b) + b)^*$ and the following two derivatives\begin{center}\begin{tabular}{l}$der\,a\,r = ((\epsilon \cdot b) + \varnothing) \cdot r$\\$der\,b\,r = ((\varnothing \cdot b) + \epsilon)\cdot r$\\$der\,c\,r = ((\varnothing \cdot b) + \varnothing)\cdot r$\end{tabular}\end{center}\noindent If we simplify them according to the simple rules from thebeginning, we can replace the right-hand sides by the smaller equivalent regular expressions\begin{center}\begin{tabular}{l}$der\,a\,r \equiv b \cdot r$\\$der\,b\,r \equiv r$\\$der\,c\,r \equiv \varnothing$\end{tabular}\end{center}\noindent I leave it to you to contemplate whether such asimplification can have any impact on the correctness of ouralgorithm (will it change any answers?). Figure~\ref{scala2}gives a simplification function that recursively traverses aregular expression and simplifies it according to the rulesgiven at the beginning. There are only rules for $+$, $\cdot$and $n$-times (the latter because we added it in the secondversion of our matcher). There is no rule for a star, becauseempirical data and also a little thought showed thatsimplifying under a star is a waste of computation time. Thesimplification function will be called after every derivation.This additional step removes all the ``junk'' the derivativefunction introduced. Does this improve the speed? You bet!!\begin{figure}[p]\lstinputlisting{../progs/app6.scala}\caption{The simplification function and modified \texttt{ders}-function; this function nowcalls \texttt{der} first, but then tries to simplifythe resulting derivative regular expressions.\label{scala2}}\end{figure}\begin{center}\begin{tikzpicture}\begin{axis}[xlabel={\pcode{a}s},ylabel={time in secs}, enlargelimits=false, xtick={0,2000,...,12000}, xmax=12000, ytick={0,5,...,30}, scaled ticks=false, axis lines=left, width=9cm, height=4cm, legend entries={Scala V2,Scala V3}]\addplot[green,mark=square*,mark options={fill=white}] table {re2b.data};\addplot[black,mark=square*,mark options={fill=white}] table {re3.data};\end{axis}\end{tikzpicture}\end{center}\end{document}%%% Local Variables: %%% mode: latex%%% TeX-master: t%%% End: