% !TEX program = xelatex\documentclass{article}\usepackage{../style}\usepackage{../langs}\usepackage{../graphics}\usepackage{../data}\begin{document}\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2018, 2019, 2020}\section*{Handout 2 (Regular Expression Matching)}%\noindent%{\bf Checklist}%%\begin{itemize}% \item You have understood the derivative-based matching algorithm.% \item You know how the derivative is related to the meaning of regular% expressions.% \item You can extend the algorithm to non-basic regular expressions.%\end{itemize}\bigskip\bigskip\bigskip\noindentThis lecture is about implementing a more efficient regular expressionmatcher (the plots on the right below)---more efficient than thematchers from regular expression libraries in Ruby, Python, JavaScriptand Java (the plots on the left). For this consider some experimentaldata: The first pair of plots shows the running time for theregular expression $(a^*)^*\cdot b$ and strings composed of $n$\pcode{a}s, like\[\underbrace{\pcode{a}\ldots\pcode{a}}_{n} \]\noindentThis means the regular expression actually does not match the strings.The second pair of plots shows the running time for the regularexpressions of the form $a^?{}^{\{n\}}\cdot a^{\{n\}}$ and correspondingstrings composed of $n$ \pcode{a}s (this time the regular expressionsmatch the strings). To see the substantial differences in the left andright plots below, note the different scales of the $x$-axes.\begin{center}Graphs: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$\begin{tabular}{@{}cc@{}}\begin{tikzpicture}[baseline=(current bounding box.north)] \begin{axis}[ xlabel={$n$}, x label style={at={(1.05,0.0)}}, ylabel={time in secs}, enlargelimits=false, xtick={0,5,...,30}, xmax=33, ymax=35, ytick={0,5,...,30}, scaled ticks=false, axis lines=left, width=5cm, height=5cm, legend entries={Java 8, Python, JavaScript}, legend pos=north west, legend cell align=left]\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};\addplot[red,mark=*, mark options={fill=white}] table {re-js.data};\end{axis}\end{tikzpicture}&\begin{tikzpicture}[baseline=(current bounding box.north)] \begin{axis}[ xlabel={$n$}, x label style={at={(1.1,0.0)}}, %%xtick={0,1000000,...,5000000}, ylabel={time in secs}, enlargelimits=false, ymax=35, ytick={0,5,...,30}, axis lines=left, %scaled ticks=false, width=6.5cm, height=5cm, legend entries={Our matcher}, legend pos=north east, legend cell align=left]%\addplot[green,mark=square*,mark options={fill=white}] table {re2a.data}; \addplot[black,mark=square*,mark options={fill=white}] table {re3a.data};\end{axis}\end{tikzpicture}\end{tabular}\end{center}\bigskip\begin{center}Graphs: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$\\\begin{tabular}{@{}cc@{}}\begin{tikzpicture}\begin{axis}[ xlabel={$n$}, x label style={at={(1.05,0.0)}}, ylabel={\small time in secs}, enlargelimits=false, xtick={0,5,...,30}, xmax=33, ymax=35, ytick={0,5,...,30}, scaled ticks=false, axis lines=left, width=5cm, height=5cm, legend entries={Python,Ruby}, legend pos=north west, legend cell align=left]\addplot[blue,mark=*, mark options={fill=white}] table {re-python.data};\addplot[brown,mark=triangle*, mark options={fill=white}] table {re-ruby.data}; \end{axis}\end{tikzpicture}&\begin{tikzpicture} \begin{axis}[ xlabel={$n$}, x label style={at={(1.1,0.05)}}, ylabel={\small time in secs}, enlargelimits=false, xtick={0,2500,...,11000}, xmax=12000, ymax=35, ytick={0,5,...,30}, scaled ticks=false, axis lines=left, width=6.5cm, height=5cm, legend entries={Our matcher}, legend pos=north east, legend cell align=left]%\addplot[green,mark=square*,mark options={fill=white}] table {re2.data};\addplot[black,mark=square*,mark options={fill=white}] table {re3.data};\end{axis}\end{tikzpicture}\end{tabular}\end{center}\bigskip\noindentIn what follows we will use these regular expressions and strings asrunning examples. There will be several versions (V1, V2, V3,\ldots)of our matcher.\footnote{The corresponding files are \texttt{re1.scala}, \texttt{re2.scala} and so on. As usual, you can find the code on KEATS.}Having specified in the previous lecture whatproblem our regular expression matcher is supposed to solve,namely for any given regular expression $r$ and string $s$answer \textit{true} if and only if\[s \in L(r)\]\noindent we can look for an algorithm to solve this problem. Clearlywe cannot use the function $L$ directly for this, because in generalthe set of strings $L$ returns is infinite (recall what $L(a^*)$ is).In such cases there is no way we can implement an exhaustive test forwhether a string is member of this set or not. In contrast ourmatching algorithm will operate on the regular expression $r$ andstring $s$, only, which are both finite objects. Before we explain the matching algorithm, let us have a closer look at what itmeans when two regular expressions are equivalent.\subsection*{Regular Expression Equivalences}We already defined in Handout 1 what it means for two regularexpressions to be equivalent, namely whether their \emph{meaning} is the same language:\[r_1 \equiv r_2 \;\dn\; L(r_1) = L(r_2)\]\noindentIt is relatively easy to verify that some concrete equivalenceshold, for example\begin{center}\begin{tabular}{rcl}$(a + b) + c$ & $\equiv$ & $a + (b + c)$\\$a + a$ & $\equiv$ & $a$\\$a + b$ & $\equiv$ & $b + a$\\$(a \cdot b) \cdot c$ & $\equiv$ & $a \cdot (b \cdot c)$\\$c \cdot (a + b)$ & $\equiv$ & $(c \cdot a) + (c \cdot b)$\\\end{tabular}\end{center}\noindentbut also easy to verify that the following regular expressionsare \emph{not} equivalent\begin{center}\begin{tabular}{rcl}$a \cdot a$ & $\not\equiv$ & $a$\\$a + (b \cdot c)$ & $\not\equiv$ & $(a + b) \cdot (a + c)$\\\end{tabular}\end{center}\noindent I leave it to you to verify these equivalences andnon-equivalences. It is also interesting to look at somecorner cases involving $\ONE$ and $\ZERO$:\begin{center}\begin{tabular}{rcl}$a \cdot \ZERO$ & $\not\equiv$ & $a$\\$a + \ONE$ & $\not\equiv$ & $a$\\$\ONE$ & $\equiv$ & $\ZERO^*$\\$\ONE^*$ & $\equiv$ & $\ONE$\\$\ZERO^*$ & $\not\equiv$ & $\ZERO$\\\end{tabular}\end{center}\noindent Again I leave it to you to make sure you agreewith these equivalences and non-equivalences. For our matching algorithm however the following sevenequivalences will play an important role:\begin{center}\begin{tabular}{rcl}$r + \ZERO$ & $\equiv$ & $r$\\$\ZERO + r$ & $\equiv$ & $r$\\$r \cdot \ONE$ & $\equiv$ & $r$\\$\ONE \cdot r$ & $\equiv$ & $r$\\$r \cdot \ZERO$ & $\equiv$ & $\ZERO$\\$\ZERO \cdot r$ & $\equiv$ & $\ZERO$\\$r + r$ & $\equiv$ & $r$\end{tabular}\end{center}\noindent They always hold no matter what the regular expression $r$looks like. The first two are easy to verify since $L(\ZERO)$ is theempty set. The next two are also easy to verify since $L(\ONE) =\{[]\}$ and appending the empty string to every string of another set,leaves the set unchanged. Be careful to fully comprehend the fifth andsixth equivalence: if you concatenate two sets of strings and one isthe empty set, then the concatenation will also be the empty set. Tosee this, check the definition of $\_ @ \_$ for sets. The lastequivalence is again trivial.What will be critical later on is that we can orient theseequivalences and read them from left to right. In this way wecan view them as \emph{simplification rules}. Consider for example the regular expression \begin{equation}(r_1 + \ZERO) \cdot \ONE + ((\ONE + r_2) + r_3) \cdot (r_4 \cdot \ZERO)\label{big}\end{equation}\noindent If we can find an equivalent regular expression that issimpler (that usually means smaller), then this might potentially makeour matching algorithm run faster. We can look for such a simpler, butequivalent, regular expression $r'$ because whether a string $s$ is in$L(r)$ or in $L(r')$ does not matter as long as $r\equiv r'$. Yes?In the example above you will see that the regular expression in\eqref{big} is equivalent to just $r_1$. You can verify this byiteratively applying the simplification rules from above:\begin{center}\begin{tabular}{ll} & $(r_1 + \ZERO) \cdot \ONE + ((\ONE + r_2) + r_3) \cdot (\underline{r_4 \cdot \ZERO})$\smallskip\\$\equiv$ & $(r_1 + \ZERO) \cdot \ONE + \underline{((\ONE + r_2) + r_3) \cdot \ZERO}$\smallskip\\$\equiv$ & $\underline{(r_1 + \ZERO) \cdot \ONE} + \ZERO$\smallskip\\$\equiv$ & $(\underline{r_1 + \ZERO}) + \ZERO$\smallskip\\$\equiv$ & $\underline{r_1 + \ZERO}$\smallskip\\$\equiv$ & $r_1$\\end{tabular}\end{center}\noindent In each step, I underlined where a simplificationrule is applied. Our matching algorithm in the next sectionwill often generate such ``useless'' $\ONE$s and$\ZERO$s, therefore simplifying them away will make thealgorithm quite a bit faster.Finally here are three equivalences between regular expressions which arenot so obvious:\begin{center}\begin{tabular}{rcl}$r^*$ & $\equiv$ & $\ONE + r\cdot r^*$\\$(r_1 + r_2)^*$ & $\equiv$ & $r_1^* \cdot (r_2\cdot r_1^*)^*$\\$(r_1 \cdot r_2)^*$ & $\equiv$ & $\ONE + r_1\cdot (r_2 \cdot r_1)^* \cdot r_2$\\\end{tabular}\end{center}\noindentWe will not use them in our algorithm, but feel free to convinceyourself that they actually hold. As an aside, there has been a lot ofresearch about questions like: Can one always decide when two regularexpressions are equivalent or not? What does an algorithm look like todecide this efficiently? So in general it is not a trivial problem.\subsection*{The Matching Algorithm}The algorithm we will introduce below consists of two parts. One is thefunction $\textit{nullable}$ which takes a regular expression as anargument and decides whether it can match the empty string (this meansit returns a boolean in Scala). This can be easily defined recursivelyas follows:\begin{center}\begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}$\textit{nullable}(\ZERO)$ & $\dn$ & $\textit{false}$\\$\textit{nullable}(\ONE)$ & $\dn$ & $\textit{true}$\\$\textit{nullable}(c)$ & $\dn$ & $\textit{false}$\\$\textit{nullable}(r_1 + r_2)$ & $\dn$ & $\textit{nullable}(r_1) \vee \textit{nullable}(r_2)$\\ $\textit{nullable}(r_1 \cdot r_2)$ & $\dn$ & $\textit{nullable}(r_1) \wedge \textit{nullable}(r_2)$\\$\textit{nullable}(r^*)$ & $\dn$ & $\textit{true}$ \\\end{tabular}\end{center}\noindent The idea behind this function is that the followingproperty holds:\[\textit{nullable}(r) \;\;\text{if and only if}\;\; []\in L(r)\]\noindent Note on the left-hand side of the if-and-only-if we have afunction we can implement, ofr example in Scala; on the right we haveits specification (which we cannot implement in a programming language).The other function of our matching algorithm calculates a\emph{derivative} of a regular expression. This is a functionwhich will take a regular expression, say $r$, and acharacter, say $c$, as arguments and returns a new regularexpression. Be mindful that the intuition behind this functionis not so easy to grasp on first reading. Essentially thisfunction solves the following problem: if $r$ can match astring of the form $c\!::\!s$, what does a regularexpression look like that can match just $s$? The definitionof this function is as follows:\begin{center}\begin{tabular}{l@ {\hspace{2mm}}c@ {\hspace{2mm}}l} $\textit{der}\, c\, (\ZERO)$ & $\dn$ & $\ZERO$\\ $\textit{der}\, c\, (\ONE)$ & $\dn$ & $\ZERO$ \\ $\textit{der}\, c\, (d)$ & $\dn$ & if $c = d$ then $\ONE$ else $\ZERO$\\ $\textit{der}\, c\, (r_1 + r_2)$ & $\dn$ & $\textit{der}\, c\, r_1 + \textit{der}\, c\, r_2$\\ $\textit{der}\, c\, (r_1 \cdot r_2)$ & $\dn$ & if $\textit{nullable} (r_1)$\\ & & then $(\textit{der}\,c\,r_1) \cdot r_2 + \textit{der}\, c\, r_2$\\ & & else $(\textit{der}\, c\, r_1) \cdot r_2$\\ $\textit{der}\, c\, (r^*)$ & $\dn$ & $(\textit{der}\,c\,r) \cdot (r^*)$ \end{tabular}\end{center}\noindent The first two clauses can be rationalised as follows: recallthat $\textit{der}$ should calculate a regular expression so thatprovided the ``input'' regular expression can match a string of theform $c\!::\!s$, we want a regular expression for $s$. Since neither$\ZERO$ nor $\ONE$ can match a string of the form $c\!::\!s$, we return$\ZERO$. In the third case we have to make a case-distinction: In casethe regular expression is $c$, then clearly it can recognise a string ofthe form $c\!::\!s$, just that $s$ is the empty string. Therefore wereturn the $\ONE$-regular expression, as it can match the empty string.In the other case we again return $\ZERO$ since no string of the$c\!::\!s$ can be matched. Next come the recursive cases, which are abit more involved. Fortunately, the $+$-case is still relativelystraightforward: all strings of the form $c\!::\!s$ are either matchedby the regular expression $r_1$ or $r_2$. So we just have to recursivelycall $\textit{der}$ with these two regular expressions and compose theresults again with $+$. Makes sense?The $\cdot$-case is more complicated: if $r_1\cdot r_2$matches a string of the form $c\!::\!s$, then the first partmust be matched by $r_1$. Consequently, it makes sense toconstruct the regular expression for $s$ by calling $\textit{der}$ with$r_1$ and ``appending'' $r_2$. There is however one exceptionto this simple rule: if $r_1$ can match the empty string, thenall of $c\!::\!s$ is matched by $r_2$. So in case $r_1$ isnullable (that is can match the empty string) we have to allowthe choice $\textit{der}\,c\,r_2$ for calculating the regularexpression that can match $s$. Therefore we have to add theregular expression $\textit{der}\,c\,r_2$ in the result. The $*$-caseis again simple: if $r^*$ matches a string of the form$c\!::\!s$, then the first part must be ``matched'' by asingle copy of $r$. Therefore we call recursively $\textit{der}\,c\,r$and ``append'' $r^*$ in order to match the rest of $s$. Stillmakes sense?If all this did not make sense yet, here is another way to explain thedefinition of $\textit{der}$ by considering the following operation onsets:\begin{equation}\label{Der}\textit{Der}\,c\,A\;\dn\;\{s\,|\,c\!::\!s \in A\}\end{equation}\noindent This operation essentially transforms a set ofstrings $A$ by filtering out all strings that do not startwith $c$ and then strips off the $c$ from all the remainingstrings. For example suppose $A = \{f\!oo, bar, f\!rak\}$ then\[ \textit{Der}\,f\,A = \{oo, rak\}\quad,\quad \textit{Der}\,b\,A = \{ar\} \quad \text{and} \quad \textit{Der}\,a\,A = \{\} \]\noindentNote that in the last case $\textit{Der}$ is empty, because no string in $A$starts with $a$. With this operation we can state the followingproperty about $\textit{der}$:\[L(\textit{der}\,c\,r) = \textit{Der}\,c\,(L(r))\]\noindentThis property clarifies what regular expression $\textit{der}$ calculates,namely take the set of strings that $r$ can match (that is $L(r)$),filter out all strings not starting with $c$ and strip off the $c$from the remaining strings---this is exactly the language that$\textit{der}\,c\,r$ can match.If we want to find out whether the string $abc$ is matched bythe regular expression $r_1$ then we can iteratively apply $\textit{der}$as follows\begin{center}\begin{tabular}{rll}Input: $r_1$, $abc$\medskip\\Step 1: & build derivative of $a$ and $r_1$ & $(r_2 = \textit{der}\,a\,r_1)$\smallskip\\Step 2: & build derivative of $b$ and $r_2$ & $(r_3 = \textit{der}\,b\,r_2)$\smallskip\\Step 3: & build derivative of $c$ and $r_3$ & $(r_4 = \textit{der}\,c\,r_3)$\smallskip\\Step 4: & the string is exhausted: & $(\textit{nullable}(r_4))$\\ & test whether $r_4$ can recognise the\\ & empty string\smallskip\\Output: & result of this test $\Rightarrow \textit{true} \,\text{or}\, \textit{false}$\\ \end{tabular}\end{center}\noindent Again the operation $\textit{Der}$ might help to rationalisethis algorithm. We want to know whether $abc \in L(r_1)$. Wedo not know yet---but let us assume it is. Then $\textit{Der}\,a\,L(r_1)$builds the set where all the strings not starting with $a$ arefiltered out. Of the remaining strings, the $a$ is strippedoff. So we should still have $bc$ in the set.Then we continue with filtering out all strings notstarting with $b$ and stripping off the $b$ from the remainingstrings, that means we build $\textit{Der}\,b\,(\textit{Der}\,a\,(L(r_1)))$.Finally we filter out all strings not starting with $c$ andstrip off $c$ from the remaining string. This is$\textit{Der}\,c\,(\textit{Der}\,b\,(\textit{Der}\,a\,(L(r_1))))$. Now if $abc$ was in the original set ($L(r_1)$), then $\textit{Der}\,c\,(\textit{Der}\,b\,(\textit{Der}\,a\,(L(r_1))))$ must contain the empty string. If not, then $abc$ was not in the language we started with.Our matching algorithm using $\textit{der}$ and $\textit{nullable}$ workssimilarly, just using regular expressions instead of sets. In order todefine our algorithm we need to extend the notion of derivatives from singlecharacters to strings. This can be done using the followingfunction, taking a string and a regular expression as input anda regular expression as output.\begin{center}\begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {\hspace{-10mm}}l@ {}} $\textit{ders}\, []\, r$ & $\dn$ & $r$ & \\ $\textit{ders}\, (c\!::\!s)\, r$ & $\dn$ & $\textit{ders}\,s\,(\textit{der}\,c\,r)$ & \\ \end{tabular}\end{center}\noindent This function iterates $\textit{der}$ taking one character atthe time from the original string until the string is exhausted.Having $\textit{der}s$ in place, we can finally define our matchingalgorithm:\[\textit{matcher}\,r\,s \dn \textit{nullable}(\textit{ders}\,s\,r)\]\noindentand we can claim that\[\textit{matcher}\,r\,s\quad\text{if and only if}\quad s\in L(r)\]\noindent holds, which means our algorithm satisfies thespecification. Of course we can claim many things\ldotswhether the claim holds any water is a different question,which for example is the point of the Strand-2 Coursework.This algorithm was introduced by Janusz Brzozowski in 1964, but is more widely known only in the last 10 or so years. Itsmain attractions are simplicity and being fast, as well asbeing easily extendible for other regular expressions such as$r^{\{n\}}$, $r^?$, $\sim{}r$ and so on (this is subject ofStrand-1 Coursework 1). \subsection*{The Matching Algorithm in Scala}Another attraction of the algorithm is that it can be easilyimplemented in a functional programming language, like Scala.Given the implementation of regular expressions in Scala shownin the first lecture and handout, the functions and subfunctionsfor \pcode{matcher} are shown in Figure~\ref{scala1}.\begin{figure}[p] \lstinputlisting[numbers=left,linebackgroundcolor= {\ifodd\value{lstnumber}\color{capri!3}\fi}] {../progs/app5.scala}\caption{A Scala implementation of \textit{nullable} and derivative function. These functions are easy to implement in functional programming languages. This is because pattern matching and recursion allow us to mimic the mathematical definitions very closely. Nearly all functional programming languages support pattern matching and recursion out of the box.\label{scala1}}\end{figure}%Remember our second example involving the regular expression%$(a^*)^* \cdot b$ which could not match strings of $n$ \texttt{a}s. %Java needed around 30 seconds to find this out a string with $n=28$.%It seems our algorithm is doing rather well in comparison:%%\begin{center}%\begin{tikzpicture}%\begin{axis}[% title={Graph: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$},% xlabel={$n$},% x label style={at={(1.05,0.0)}},% ylabel={time in secs},% enlargelimits=false,% xtick={0,1000,...,6500},% xmax=6800,% ytick={0,5,...,30},% ymax=34,% scaled ticks=false,% axis lines=left,% width=8cm,% height=4.5cm, % legend entries={Java,Scala V1}, % legend pos=north east,% legend cell align=left]%\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};%\addplot[red,mark=triangle*,mark options={fill=white}] table {re1a.data};%\end{axis}%\end{tikzpicture}%\end{center}%%\noindent%This is not an error: it hardly takes more than half a second for%strings up to the length of 6500. After that we receive a%StackOverflow exception, but still\ldotsFor running the algorithm with our first example, the evilregular expression $a^?{}^{\{n\}}\cdot a^{\{n\}}$, we need to implementthe optional regular expression and the `exactly $n$-timesregular expression'. This can be done with the translations\lstinputlisting[numbers=none]{../progs/app51.scala}\noindent Running the matcher with this example, we find it isslightly worse then the matcher in Ruby and Python.Ooops\ldots\begin{center}\begin{tikzpicture}\begin{axis}[ title={Graph: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$}, xlabel={$n$}, x label style={at={(1.05,0.0)}}, ylabel={time in secs}, enlargelimits=false, xtick={0,5,...,30}, xmax=32, ytick={0,5,...,30}, scaled ticks=false, axis lines=left, width=6cm, height=5cm, legend entries={Python,Ruby,Scala V1}, legend pos=outer north east, legend cell align=left]\addplot[blue,mark=*, mark options={fill=white}] table {re-python.data};\addplot[brown,mark=pentagon*, mark options={fill=white}] table {re-ruby.data}; \addplot[red,mark=triangle*,mark options={fill=white}] table {re1.data}; \end{axis}\end{tikzpicture}\end{center}\noindent Analysing this failure we notice that for $a^{\{n\}}$, forexample, we generate quite big regular expressions:\begin{center}\begin{tabular}{rl}1: & $a$\\2: & $a\cdot a$\\3: & $a\cdot a\cdot a$\\& \ldots\\13: & $a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a$\\& \ldots\end{tabular}\end{center}\noindent Our algorithm traverses such regular expressions atleast once every time a derivative is calculated. So havinglarge regular expressions will cause problems. This problemis aggravated by $a^?$ being represented as $a + \ONE$.We can however fix this easily by having an explicit constructor for$r^{\{n\}}$. In Scala we would introduce a constructor like\begin{center}\code{case class NTIMES(r: Rexp, n: Int) extends Rexp}\end{center}\noindent With this fix we have a constant ``size'' regular expressionfor our running example no matter how large $n$ is (see the\texttt{size} section in the implementations). This means we have toalso add cases for \pcode{NTIMES} in the functions $\textit{nullable}$and $\textit{der}$. Does the change have any effect?\begin{center}\begin{tikzpicture}\begin{axis}[ title={Graph: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$}, xlabel={$n$}, x label style={at={(1.01,0.0)}}, ylabel={time in secs}, enlargelimits=false, xtick={0,200,...,1100}, xmax=1200, ytick={0,5,...,30}, scaled ticks=false, axis lines=left, width=10cm, height=5cm, legend entries={Python,Ruby,Scala V1,Scala V2}, legend pos=outer north east, legend cell align=left]\addplot[blue,mark=*, mark options={fill=white}] table {re-python.data};\addplot[brown,mark=pentagon*, mark options={fill=white}] table {re-ruby.data}; \addplot[red,mark=triangle*,mark options={fill=white}] table {re1.data}; \addplot[green,mark=square*,mark options={fill=white}] table {re2.data};\end{axis}\end{tikzpicture}\end{center}\noindent Now we are talking business! The modified matcher can within25 seconds handle regular expressions up to $n = 1,100$ before aStackOverflow is raised. Recall that Python and Ruby (and our firstversion, Scala V1) could only handle $n = 27$ or so in 30seconds. We have not tried our algorithm on the second example $(a^*)^* \cdotb$---I leave this to you.The moral is that our algorithm is rather sensitive to thesize of regular expressions it needs to handle. This is ofcourse obvious because both $\textit{nullable}$ and $\textit{der}$ frequentlyneed to traverse the whole regular expression. There seems,however, one more issue for making the algorithm run faster.The derivative function often produces ``useless''$\ZERO$s and $\ONE$s. To see this, consider $r = ((a\cdot b) + b)^*$ and the following three derivatives\begin{center}\begin{tabular}{l}$\textit{der}\,a\,r = ((\ONE \cdot b) + \ZERO) \cdot r$\\$\textit{der}\,b\,r = ((\ZERO \cdot b) + \ONE)\cdot r$\\$\textit{der}\,c\,r = ((\ZERO \cdot b) + \ZERO)\cdot r$\end{tabular}\end{center}\noindent If we simplify them according to the simplification rules from thebeginning, we can replace the right-hand sides by the smallerequivalent regular expressions\begin{center}\begin{tabular}{l}$\textit{der}\,a\,r \equiv b \cdot r$\\$\textit{der}\,b\,r \equiv r$\\$\textit{der}\,c\,r \equiv \ZERO$\end{tabular}\end{center}\noindent I leave it to you to contemplate whether such asimplification can have any impact on the correctness of our algorithm(will it change any answers?). Figure~\ref{scala2} gives asimplification function that recursively traverses a regularexpression and simplifies it according to the rules given at thebeginning. There are only rules for $+$ and $\cdot$. There is no simplification rule for a star, becauseempirical data and also a little thought showed that simplifying undera star is a waste of computation time. The simplification functionwill be called after every derivation. This additional step removesall the ``junk'' the derivative function introduced. Does this improvethe speed? You bet!!\begin{figure}[p]\lstinputlisting[numbers=left,linebackgroundcolor= {\ifodd\value{lstnumber}\color{capri!3}\fi}] {../progs/app6.scala}\caption{The simplification function and modified \texttt{ders}-function; this function nowcalls \texttt{der} first, but then simplifiesthe resulting derivative regular expressions beforebuilding the next derivative, seeLine~24.\label{scala2}}\end{figure}\begin{center}\begin{tikzpicture}\begin{axis}[ title={Graph: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$}, xlabel={$n$}, x label style={at={(1.04,0.0)}}, ylabel={time in secs}, enlargelimits=false, xtick={0,2500,...,10000}, xmax=12000, ytick={0,5,...,30}, ymax=32, scaled ticks=false, axis lines=left, width=9cm, height=5cm, legend entries={Scala V2,Scala V3}, legend pos=outer north east, legend cell align=left]\addplot[green,mark=square*,mark options={fill=white}] table {re2.data};\addplot[black,mark=square*,mark options={fill=white}] table {re3.data};\end{axis}\end{tikzpicture}\end{center}\noindentTo recap, Python and Ruby needed approximately 30 seconds to match astring of 28 \texttt{a}s and the regular expression $a^{?\{n\}} \cdota^{\{n\}}$. We need a third of this time to do the same with stringsup to 11,000 \texttt{a}s. Similarly, Java 8 and Python needed 30seconds to find out the regular expression $(a^*)^* \cdot b$ does notmatch the string of 28 \texttt{a}s. In Java 9 and later this has been cranked up to 39,000 \texttt{a}s, but we can do the same in the same amount of time for strings composed of nearly 6,000,000 \texttt{a}s. This is shown in the following plot.\begin{center}\begin{tikzpicture}\begin{axis}[ title={Graph: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$}, xlabel={$n$}, ylabel={time in secs}, enlargelimits=false, ymax=35, ytick={0,5,...,30}, axis lines=left, %%scaled ticks=false, x label style={at={(1.09,0.0)}}, %%xmax=7700000, width=9cm, height=5cm, legend entries={Scala V3}, legend pos=outer north east, legend cell align=left]%\addplot[green,mark=square*,mark options={fill=white}] table {re2a.data};\addplot[black,mark=square*,mark options={fill=white}] table {re3a.data};\end{axis}\end{tikzpicture}\end{center}\subsection*{Epilogue}(23/Aug/2016) I found another place where this algorithm canbe sped up (this idea is not integrated with what is coming next, butI present it nonetheless). The idea is to not define \texttt{ders}that it iterates the derivative character-by-character, but in biggerchunks. The resulting code for \texttt{ders2} looks as follows:\lstinputlisting[numbers=none]{../progs/app52.scala} \noindentI have not fully understood why this version is much faster,but it seems it is a combination of the clauses for \texttt{ALT}and \texttt{SEQ}. In the latter case we call \texttt{der} with a single character and this potentially produces an alternative.The derivative of such an alternative can then be more efficiently calculated by \texttt{ders2} since it pushes a whole stringunder an \texttt{ALT}. The numbers are that in the second case $(a^*)^* \cdot b$ both versions are pretty much the same, but in the first case $a^{?\{n\}} \cdot a^{\{n\}}$ the improvement gives another factor of 100 speedup. Nice!\begin{center}\begin{tabular}{cc}\begin{tikzpicture}\begin{axis}[ title={Graph: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$}, xlabel={$n$}, x label style={at={(1.04,0.0)}}, ylabel={time in secs}, enlargelimits=false, xmax=7100000, ytick={0,5,...,30}, ymax=33, %scaled ticks=false, axis lines=left, width=5.3cm, height=5cm, legend entries={Scala V3, Scala V4}, legend style={at={(0.1,-0.2)},anchor=north}]\addplot[black,mark=square*,mark options={fill=white}] table {re3.data};\addplot[purple,mark=square*,mark options={fill=white}] table {re4.data};\end{axis}\end{tikzpicture}&\begin{tikzpicture}\begin{axis}[ title={Graph: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$}, xlabel={$n$}, x label style={at={(1.09,0.0)}}, ylabel={time in secs}, enlargelimits=false, xmax=8200000, ytick={0,5,...,30}, ymax=33, %scaled ticks=false, axis lines=left, width=5.3cm, height=5cm, legend entries={Scala V3, Scala V4}, legend style={at={(0.1,-0.2)},anchor=north}]\addplot[black,mark=square*,mark options={fill=white}] table {re3a.data};\addplot[purple,mark=square*,mark options={fill=white}] table {re4a.data};\end{axis}\end{tikzpicture}\end{tabular}\end{center}\section*{Proofs}You might not like doing proofs. But they serve a veryimportant purpose in Computer Science: How can we be sure thatour algorithm matches its specification? We can try to testthe algorithm, but that often overlooks corner cases and anexhaustive testing is impossible (since there are infinitelymany inputs). Proofs allow us to ensure that an algorithmreally meets its specification. For the programs we look at in this module, the proofs willmostly by some form of induction. Remember that regularexpressions are defined as \begin{center}\begin{tabular}{r@{\hspace{1mm}}r@{\hspace{1mm}}l@{\hspace{13mm}}l} $r$ & $::=$ & $\ZERO$ & nothing\\ & $\mid$ & $\ONE$ & empty string / \texttt{""} / []\\ & $\mid$ & $c$ & single character\\ & $\mid$ & $r_1 + r_2$ & alternative / choice\\ & $\mid$ & $r_1 \cdot r_2$ & sequence\\ & $\mid$ & $r^*$ & star (zero or more)\\ \end{tabular}\end{center}\noindent If you want to show a property $P(r)$ for \emph{all} regular expressions $r$, then you have to follow essentially the recipe:\begin{itemize}\item $P$ has to hold for $\ZERO$, $\ONE$ and $c$ (these are the base cases).\item $P$ has to hold for $r_1 + r_2$ under the assumption that $P$ already holds for $r_1$ and $r_2$.\item $P$ has to hold for $r_1 \cdot r_2$ under the assumption that $P$ already holds for $r_1$ and $r_2$.\item $P$ has to hold for $r^*$ under the assumption that $P$ already holds for $r$.\end{itemize}\noindent A simple proof is for example showing the following property:\begin{equation}\textit{nullable}(r) \;\;\text{if and only if}\;\; []\in L(r)\label{nullableprop}\end{equation}\noindentLet us say that this property is $P(r)$, then the first casewe need to check is whether $P(\ZERO)$ (see recipe above). So we have to show that\[\textit{nullable}(\ZERO) \;\;\text{if and only if}\;\; []\in L(\ZERO)\]\noindent whereby $\textit{nullable}(\ZERO)$ is by definition ofthe function $\textit{nullable}$ always $\textit{false}$. We also havethat $L(\ZERO)$ is by definition $\{\}$. It isimpossible that the empty string $[]$ is in the empty set.Therefore also the right-hand side is false. Consequently weverified this case: both sides are false. We would still needto do this for $P(\ONE)$ and $P(c)$. I leave this toyou to verify.Next we need to check the inductive cases, for example$P(r_1 + r_2)$, which is\begin{equation}\textit{nullable}(r_1 + r_2) \;\;\text{if and only if}\;\; []\in L(r_1 + r_2)\label{propalt}\end{equation}\noindent The difference to the base cases is that in the inductivecases we can already assume we proved $P$ for the components, that iswe can assume.\begin{center}\begin{tabular}{l}$\textit{nullable}(r_1) \;\;\text{if and only if}\;\; []\in L(r_1)$ and\\$\textit{nullable}(r_2) \;\;\text{if and only if}\;\; []\in L(r_2)$\\\end{tabular}\end{center}\noindent These are called the induction hypotheses. To check this case, we can start from $\textit{nullable}(r_1 + r_2)$, which by definition of $\textit{nullable}$ is\[\textit{nullable}(r_1) \vee \textit{nullable}(r_2)\]\noindent Using the two induction hypotheses from above,we can transform this into \[[] \in L(r_1) \vee []\in(r_2)\]\noindent We just replaced the $\textit{nullable}(\ldots)$ parts bythe equivalent $[] \in L(\ldots)$ from the induction hypotheses. A bit of thinking convinces you that if$[] \in L(r_1) \vee []\in L(r_2)$ then the empty stringmust be in the union $L(r_1)\cup L(r_2)$, that is\[[] \in L(r_1)\cup L(r_2)\]\noindent but this is by definition of $L$ exactly $[] \in L(r_1 +r_2)$, which we needed to establish according to statement in\eqref{propalt}. What we have shown is that starting from$\textit{nullable}(r_1 + r_2)$ we have done equivalent transformationsto end up with $[] \in L(r_1 + r_2)$. Consequently we have establishedthat $P(r_1 + r_2)$ holds.In order to complete the proof we would now need to look at the cases \mbox{$P(r_1\cdot r_2)$} and $P(r^*)$. Again I let youcheck the details.You might also have to do induction proofs over strings. That means you want to establish a property $P(s)$ for allstrings $s$. For this remember strings are lists of characters. These lists can be either the empty list or alist of the form $c::s$. If you want to perform an inductionproof for strings you need to consider the cases\begin{itemize}\item $P$ has to hold for $[]$ (this is the base case).\item $P$ has to hold for $c::s$ under the assumption that $P$ already holds for $s$.\end{itemize}\noindentGiven this recipe, I let you show\begin{equation}\textit{Ders}\,s\,(L(r)) = L(\textit{ders}\,s\,r)\label{dersprop}\end{equation}\noindent by induction on $s$. Recall $\textit{Der}$ is defined for character---see \eqref{Der}; $\textit{Ders}$ is similar, but for strings:\[\textit{Ders}\,s\,A\;\dn\;\{s'\,|\,s @ s' \in A\}\]\noindent In this proof you can assume the following propertyfor $der$ and $\textit{Der}$ has already been proved, that is you canassume\[L(\textit{der}\,c\,r) = \textit{Der}\,c\,(L(r))\]\noindent holds (this would be of course another property that needsto be proved in a side-lemma by induction on $r$). This is a bitmore challenging, but not impossible.To sum up, using reasoning like the one shown above allows us to show the correctness of our algorithm. To see this,start from the specification\[s \in L(r)\]\noindent That is the problem we want to solve. Thinking a little, you will see that this problem is equivalent to the following problem\begin{equation}[] \in \textit{Ders}\,s\,(L(r))\label{dersstep}\end{equation}\noindent You agree? But we have shown above in \eqref{dersprop},that the $\textit{Ders}$ can be replaced by$L(\textit{ders}\ldots)$. That means \eqref{dersstep} is equivalent to\begin{equation}[] \in L(\textit{ders}\,s\,r)\label{prefinalstep}\end{equation}\noindent We have also shown that testing whether the emptystring is in a language is equivalent to the $\textit{nullable}$function; see \eqref{nullableprop}. That means\eqref{prefinalstep} is equivalent with\[\textit{nullable}(\textit{ders}\,s\,r)\] \noindent But this is just the definition of $matcher$\[matcher\,s\,r \dn nullable(\textit{ders}\,s\,r)\]\noindent In effect we have shown\[matcher\,s\,r\;\;\text{if and only if}\;\;s\in L(r)\]\noindent which is the property we set out to prove: our algorithmmeets its specification. To have done so, requires a few inductionproofs about strings and regular expressions. Following the \emph{inductionrecipes} is already a big step in actually performing these proofs.If you do not believe it, proofs have helped me to make sure my codeis correct and in several instances prevented me of letting slipembarrassing mistakes into the `wild'. \end{document}% !TeX program = latexmk -xelatex%%% Local Variables: %%% mode: latex%%% TeX-master: t%%% End: