\documentclass{article}\usepackage{charter}\usepackage{hyperref}\usepackage{amssymb}\usepackage{amsmath}\usepackage[T1]{fontenc}\usepackage{listings}\usepackage{xcolor}\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%\definecolor{javared}{rgb}{0.6,0,0} % for strings\definecolor{javagreen}{rgb}{0.25,0.5,0.35} % comments\definecolor{javapurple}{rgb}{0.5,0,0.35} % keywords\definecolor{javadocblue}{rgb}{0.25,0.35,0.75} % javadoc\lstdefinelanguage{scala}{ morekeywords={abstract,case,catch,class,def,% do,else,extends,false,final,finally,% for,if,implicit,import,match,mixin,% new,null,object,override,package,% private,protected,requires,return,sealed,% super,this,throw,trait,true,try,% type,val,var,while,with,yield}, otherkeywords={=>,<-,<\%,<:,>:,\#,@}, sensitive=true, morecomment=[l]{//}, morecomment=[n]{/*}{*/}, morestring=[b]", morestring=[b]', morestring=[b]"""}\lstset{language=Scala, basicstyle=\ttfamily, keywordstyle=\color{javapurple}\bfseries, stringstyle=\color{javagreen}, commentstyle=\color{javagreen}, morecomment=[s][\color{javadocblue}]{/**}{*/}, numbers=left, numberstyle=\tiny\color{black}, stepnumber=1, numbersep=10pt, tabsize=2, showspaces=false, showstringspaces=false}\begin{document}\section*{Handout 2}Having specified what problem our matching algorithm, $match$, is supposed to solve, namelyfor a given regular expression $r$ and string $s$ answer $true$ if and only if\[s \in L(r)\]\noindentwe can look at an algorithm to solve this problem.Clearly we cannot use the function $L$ directly for this, because in generalthe set of strings $L$ returns is infinite (recall what $L(a^*)$ is). In such cases there is no way we can implementan exhaustive test for whether a string is member of this set or not.The algorithm we will define below consists of two parts. One is the function $nullable$ which takes aregular expression as argument and decides whether it can match the empty string (this means it returns a boolean). This can be easily defined recursively as follows:\begin{center}\begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}$nullable(\varnothing)$ & $\dn$ & $f\!\/alse$\\$nullable(\epsilon)$ & $\dn$ & $true$\\$nullable (c)$ & $\dn$ & $f\!alse$\\$nullable (r_1 + r_2)$ & $\dn$ & $nullable(r_1) \vee nullable(r_2)$\\ $nullable (r_1 \cdot r_2)$ & $\dn$ & $nullable(r_1) \wedge nullable(r_2)$\\$nullable (r^*)$ & $\dn$ & $true$ \\\end{tabular}\end{center}\noindentThe idea behind this function is that the following property holds:\[nullable(r) \;\;\text{if and only if}\;\; ""\in L(r)\]\noindentNote on the left-hand side we have a function we can implement; on the right we have its specification. The other function of our matching algorithm calculates a \emph{derivative} of a regular expression. This is a functionwhich will take a regular expression, say $r$, and a character, say $c$, as argument and return a new regular expression. Be careful that the intuition behind this function is not so easy to grasp on firstreading. Essentially this function solves the following problem: if $r$ can match a string of the form$c\!::\!s$, what does the regular expression look like that can match just $s$. The definition of thisfunction is as follows:\begin{center}\begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {\hspace{-10mm}}l@ {}} $der\, c\, (\varnothing)$ & $\dn$ & $\varnothing$ & \\ $der\, c\, (\epsilon)$ & $\dn$ & $\varnothing$ & \\ $der\, c\, (d)$ & $\dn$ & if $c = d$ then $\epsilon$ else $\varnothing$ & \\ $der\, c\, (r_1 + r_2)$ & $\dn$ & $der\, c\, r_1 + der\, c\, r_2$ & \\ $der\, c\, (r_1 \cdot r_2)$ & $\dn$ & if $nullable (r_1)$\\ & & then $(der\,c\,r_1) \cdot r_2 + der\, c\, r_2$\\ & & else $(der\, c\, r_1) \cdot r_2$\\ $der\, c\, (r^*)$ & $\dn$ & $(der\,c\,r) \cdot (r^*)$ & \end{tabular}\end{center}\noindentThe first two clauses can be rationalised as follows: recall that $der$ should calculate a regularexpression, if the ``input'' regular expression can match a string of the form $c\!::\!s$. Since neither$\varnothing$ nor $\epsilon$ can match such a string we return $\varnothing$. In the third casewe have to make a case-distinction: In case the regular expression is $c$, then clearly it can recognisea string of the form $c\!::\!s$, just that $s$ is the empty string. Therefore we return the $\epsilon$-regular expression. In the other case we again return $\varnothing$ since no string of the $c\!::\!s$ can be matched.The $+$-case is relatively straightforward: all strings of the form $c\!::\!s$ are either matched by theregular expression $r_1$ or $r_2$. So we just have to recursively call $der$ with these two regularexpressions and compose the results again with $+$. The $\cdot$-case is more complicated:if $r_1\cdot r_2$ matches a string of the form $c\!::\!s$, then the first part must be matched by $r_1$.Consequently, it makes sense to construct the regular expression for $s$ by calling $der$ with $r_1$ and``appending'' $r_2$. There is however one exception to this simple rule: if $r_1$ can match the emptystring, then all of $c\!::\!s$ is matched by $r_2$. So in case $r_1$ is nullable (that is can match theempty string) we have to allow the choice $der\,c\,r_2$ for calculating the regular expression that can match $s$. The $*$-case is again simple: if $r^*$ matches a string of the form $c\!::\!s$, then the first part must be``matched'' by a single copy of $r$. Therefore we call recursively $der\,c\,r$ and ``append'' $r^*$ in order to match the rest of $s$.Another way to rationalise the definition of $der$ is to consider the following operation on sets:\[Der\,c\,A\;\dn\;\{s\,|\,c\!::\!s \in A\}\]\noindentwhich essentially transforms a set of strings $A$ by filtering out all strings that do not start with $c$ and thenstrips off the $c$ from all the remaining strings. For example suppose $A = \{"f\!oo", "bar", "f\!rak"\}$ then\[Der\,f\,A = \{"oo", "rak"\}\quad,\quadDer\,b\,A = \{"ar"\} \quad \text{and} \quadDer\,a\,A = \varnothing\]\noindentNote that in the last case $Der$ is empty, because no string in $A$ starts with $a$. With this operation we canstate the following property about $der$:\[L(der\,c\,r) = Der\,c\,(L(r))\]\noindentThis property clarifies what regular expression $der$ calculates, namely take the set of stringsthat $r$ can match (that is $L(r)$), filter out all strings not starting with $c$ and strip off the $c$ from theremaining strings---this is exactly the language that $der\,c\,r$ can match.If we want to find out whether the string $"abc"$ is matched by the regular expression $r$then we can iteratively apply $Der$ as follows\begin{enumerate}\item $Der\,a\,(L(r))$\item $Der\,b\,(Der\,a\,(L(r)))$\item $Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$\end{enumerate}\noindentIn the last step we need to test whether the empty string is in the set. Our matching algorithm will work similarly, just using regular expression instead of sets. For this we need to lift the notion of derivatives from characters to strings. This can bedone using the following function, taking a string and regular expression as input and a regular expression as output.\begin{center}\begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {\hspace{-10mm}}l@ {}} $der\!s\, []\, r$ & $\dn$ & $r$ & \\ $der\!s\, (c\!::\!s)\, r$ & $\dn$ & $der\!s\,s\,(der\,c\,r)$ & \\ \end{tabular}\end{center}\noindentHaving $ders$ in place, we can finally define our matching algorithm:\[match\,s\,r = nullable(ders\,s\,r)\]\noindentWe claim that\[match\,s\,r\quad\text{if and only if}\quad s\in L(r)\]\noindentholds, which means our algorithm satisfies the specification. This algorithm was introduced byJanus Brzozowski in 1964. Its main attractions are simplicity and being fast, as well asbeing easily extendable for other regular expressions such as $r^{\{n\}}$, $r^?$, $\sim{}r$ and so on.\begin{figure}[p]{\lstset{language=Scala}\texttt{\lstinputlisting{../progs/app5.scala}}}{\lstset{language=Scala}\texttt{\lstinputlisting{../progs/app6.scala}}}\caption{Scala implementation of the nullable and derivatives functions.}\end{figure}\end{document}%%% Local Variables: %%% mode: latex%%% TeX-master: t%%% End: