handouts/ho02.tex
changeset 261 24531cfaa36a
parent 259 e5f4b8ff23b8
child 262 ee4304bc6350
equal deleted inserted replaced
260:65d1ea0e989f 261:24531cfaa36a
     1 \documentclass{article}
     1 \documentclass{article}
     2 \usepackage{../style}
     2 \usepackage{../style}
     3 \usepackage{../langs}
     3 \usepackage{../langs}
     4 
     4 \usepackage{../graphics}
       
     5 \usepackage{../data}
     5 
     6 
     6 \begin{document}
     7 \begin{document}
     7 
     8 
     8 \section*{Handout 2}
     9 \section*{Handout 2}
     9 
    10 
    10 Having specified what problem our matching algorithm, \pcode{match},
    11 This lecture is about implementing a more efficient regular
    11 is supposed to solve, namely for a given regular expression $r$ and
    12 expression matcher (the plots on the right)---more efficient
    12 string $s$ answer \textit{true} if and only if
    13 than the matchers from regular expression libraries in Ruby and
       
    14 Python (the plots on the left). These plots show the running 
       
    15 time for the evil regular expression $a?^{\{n\}}a^{\{n\}}$.
       
    16 
       
    17 \begin{center}
       
    18 \begin{tabular}{@{}cc@{}}
       
    19 \begin{tikzpicture}[y=.072cm, x=.12cm]
       
    20  	%axis
       
    21 	\draw (0,0) -- coordinate (x axis mid) (30,0);
       
    22    \draw (0,0) -- coordinate (y axis mid) (0,30);
       
    23    %ticks
       
    24    \foreach \x in {0,5,...,30}
       
    25      \draw (\x,1pt) -- (\x,-3pt) node[anchor=north] {\x};
       
    26    \foreach \y in {0,5,...,30}
       
    27      \draw (1pt,\y) -- (-3pt,\y) node[anchor=east] {\y}; 
       
    28 	%labels      
       
    29 	\node[below=0.6cm] at (x axis mid) {number of \texttt{a}s};
       
    30 	\node[rotate=90,left=0.9cm] at (y axis mid) {time in secs};
       
    31 	%plots
       
    32 	\draw[color=blue] plot[mark=*] 
       
    33 		file {re-python.data};
       
    34 	\draw[color=brown] plot[mark=triangle*] 
       
    35 		file {re-ruby.data};	 
       
    36    %legend
       
    37 	\begin{scope}[shift={(4,20)}] 
       
    38 	\draw[color=blue] (0,0) -- 
       
    39 		plot[mark=*] (0.25,0) -- (0.5,0) 
       
    40 		node[right]{\small Python};
       
    41 	\draw[yshift=-4mm, color=brown] (0,0) -- 
       
    42 		plot[mark=triangle*] (0.25,0) -- (0.5,0)
       
    43 		node[right]{\small Ruby};
       
    44 	\end{scope}	
       
    45 \end{tikzpicture} 
       
    46 &
       
    47 \begin{tikzpicture}[y=.072cm, x=.0004cm]
       
    48  	%axis
       
    49 	\draw (0,0) -- coordinate (x axis mid) (12000,0);
       
    50    \draw (0,0) -- coordinate (y axis mid) (0,30);
       
    51    %ticks
       
    52    \foreach \x in {0,3000,...,12000}
       
    53      	\draw (\x,1pt) -- (\x,-3pt) node[anchor=north] {\x};
       
    54    \foreach \y in {0,5,...,30}
       
    55      	\draw (1pt,\y) -- (-3pt,\y) node[anchor=east] {\y}; 
       
    56 	%labels      
       
    57 	\node[below=0.6cm] at (x axis mid) {number of \texttt{a}s};
       
    58 	\node[rotate=90,left=0.9cm] at (y axis mid) {time in secs};
       
    59 
       
    60 	%plots
       
    61    \draw[color=green] plot[mark=square*, mark options={fill=white} ] 
       
    62 		file {re2b.data};
       
    63 	\draw[color=black] plot[mark=square*, mark options={fill=white} ] 
       
    64 		file {re3.data};	 
       
    65 \end{tikzpicture}
       
    66 \end{tabular}
       
    67 \end{center}\medskip
       
    68 
       
    69 
       
    70 \noindent Having specified in the previous lecture what
       
    71 problem our regular expression matcher, which we will call
       
    72 \pcode{matches}, is supposed to solve, namely for any given
       
    73 regular expression $r$ and string $s$ answer \textit{true} if
       
    74 and only if
    13 
    75 
    14 \[
    76 \[
    15 s \in L(r)
    77 s \in L(r)
    16 \]
    78 \]
    17 
    79 
    18 \noindent we can look at an algorithm to solve this problem.
    80 \noindent we can look at an algorithm to solve this problem.
    19 Clearly we cannot use the function $L$ directly for this,
    81 Clearly we cannot use the function $L$ directly for this,
    20 because in general the set of strings $L$ returns is infinite
    82 because in general the set of strings $L$ returns is infinite
    21 (recall what $L(a^*)$ is). In such cases there is no way we
    83 (recall what $L(a^*)$ is). In such cases there is no way we
    22 can implement an exhaustive test for whether a string is
    84 can implement an exhaustive test for whether a string is
    23 member of this set or not. Before we come to the matching 
    85 member of this set or not. In contrast our matching algorithm
    24 algorithm, lets have a closer look at what it means when
    86 will mainly operate on the regular expression $r$ and string
    25 two regular expressions are equivalent.
    87 $s$, which are both finite. Before we come to the matching
       
    88 algorithm, however, let us have a closer look at what it means
       
    89 when two regular expressions are equivalent.
    26 
    90 
    27 \subsection*{Regular Expression Equivalences}
    91 \subsection*{Regular Expression Equivalences}
    28 
    92 
    29 
    93 We already defined in Handout 1 what it means for two regular
    30 \subsection*{Matching Algorithm}
    94 expressions to be equivalent, namely if their meaning is the
    31 
    95 same language:
    32 The algorithm we will define below consists of two parts. One is the
    96 
    33 function $nullable$ which takes a regular expression as argument and
    97 \[
    34 decides whether it can match the empty string (this means it returns a
    98 r_1 \equiv r_2 \;\dn\; L(r_1) = L(r_2)
    35 boolean). This can be easily defined recursively as follows:
    99 \]
       
   100 
       
   101 \noindent
       
   102 It is relatively easy to verify that some concrete equivalences
       
   103 hold, for example
       
   104 
       
   105 \begin{center}
       
   106 \begin{tabular}{rcl}
       
   107 $(a + b)  + c$ & $\equiv$ & $a + (b + c)$\\
       
   108 $a + a$        & $\equiv$ & $a$\\
       
   109 $a + b$        & $\equiv$ & $b + a$\\
       
   110 $(a \cdot b) \cdot c$ & $\equiv$ & $a \cdot (b \cdot c)$\\
       
   111 $c \cdot (a + b)$ & $\equiv$ & $(c \cdot a) + (c \cdot b)$\\
       
   112 \end{tabular}
       
   113 \end{center}
       
   114 
       
   115 \noindent
       
   116 but also easy to verify that the following regular expressions
       
   117 are \emph{not} equivalent
       
   118 
       
   119 \begin{center}
       
   120 \begin{tabular}{rcl}
       
   121 $a \cdot a$ & $\not\equiv$ & $a$\\
       
   122 $a + (b \cdot c)$ & $\not\equiv$ & $(a + b) \cdot (a + c)$\\
       
   123 \end{tabular}
       
   124 \end{center}
       
   125 
       
   126 \noindent I leave it to you to verify these equivalences and
       
   127 non-equivalences. It is also interesting to look at some
       
   128 corner cases involving $\epsilon$ and $\varnothing$:
       
   129 
       
   130 \begin{center}
       
   131 \begin{tabular}{rcl}
       
   132 $a \cdot \varnothing$ & $\not\equiv$ & $a$\\
       
   133 $a + \epsilon$ & $\not\equiv$ & $a$\\
       
   134 $\epsilon$ & $\equiv$ & $\varnothing^*$\\
       
   135 $\epsilon^*$ & $\equiv$ & $\epsilon$\\
       
   136 $\varnothing^*$ & $\not\equiv$ & $\varnothing$\\
       
   137 \end{tabular}
       
   138 \end{center}
       
   139 
       
   140 \noindent Again I leave it to you to make sure you agree
       
   141 with these equivalences and non-equivalences. 
       
   142 
       
   143 
       
   144 For our matching algorithm however the following six
       
   145 equivalences will play an important role:
       
   146 
       
   147 \begin{center}
       
   148 \begin{tabular}{rcl}
       
   149 $r + \varnothing$  & $\equiv$ & $r$\\
       
   150 $\varnothing + r$  & $\equiv$ & $r$\\
       
   151 $r \cdot \epsilon$ & $\equiv$ & $r$\\
       
   152 $\epsilon \cdot r$     & $\equiv$ & $r$\\
       
   153 $r \cdot \varnothing$ & $\equiv$ & $\varnothing$\\
       
   154 $\varnothing \cdot r$ & $\equiv$ & $\varnothing$\\
       
   155 $r + r$ & $\equiv$ & $r$
       
   156 \end{tabular}
       
   157 \end{center}
       
   158 
       
   159 \noindent which always hold no matter what the regular
       
   160 expression $r$ looks like. The first are easy to verify since
       
   161 $L(\varnothing)$ is the empty set. The next two are also easy 
       
   162 to verify since $L(\epsilon) = \{[]\}$ and appending the empty 
       
   163 string to every string of another set, leaves the set 
       
   164 unchanged. Be careful to fully comprehend the fifth and 
       
   165 sixth equivalence: if you concatenate two sets of strings
       
   166 and one is the empty set, then the concatenation will also be
       
   167 the empty set. Check the definition of \pcode{_ @ _}.
       
   168 The last equivalence is again trivial.
       
   169 
       
   170 
       
   171 What will be important later on is that we can orient these
       
   172 equivalences and read them from left to right. In this way we
       
   173 can view them as \emph{simplification rules}. Suppose for 
       
   174 example the regular expression 
       
   175 
       
   176 \begin{equation}
       
   177 (r_1 + \varnothing) \cdot \epsilon + ((\epsilon + r_2) + r_3) \cdot (r_4 \cdot \varnothing)
       
   178 \label{big}
       
   179 \end{equation}
       
   180 
       
   181 \noindent If we can find an equivalent regular expression that
       
   182 is simpler (smaller for example), then this might potentially 
       
   183 make our matching algorithm is faster. The reason is that 
       
   184 whether a string $s$ is in $L(r)$ or in $L(r')$ with $r\equiv r'$
       
   185 will always give the same answer. In the example above you 
       
   186 will see that the regular expression is equivalent to $r_1$
       
   187 if you iteratively apply the simplification rules from above:
       
   188 
       
   189 \begin{center}
       
   190 \begin{tabular}{ll}
       
   191  & $(r_1 + \varnothing) \cdot \epsilon + ((\epsilon + r_2) + r_3) \cdot 
       
   192 (\underline{r_4 \cdot \varnothing})$\smallskip\\
       
   193 $\equiv$ & $(r_1 + \varnothing) \cdot \epsilon + \underline{((\epsilon + r_2) + r_3) \cdot 
       
   194 \varnothing}$\smallskip\\
       
   195 $\equiv$ & $\underline{(r_1 + \varnothing) \cdot \epsilon} + \varnothing$\smallskip\\
       
   196 $\equiv$ & $(\underline{r_1 + \varnothing}) + \varnothing$\smallskip\\
       
   197 $\equiv$ & $\underline{r_1 + \varnothing}$\smallskip\\
       
   198 $\equiv$ & $r_1$\
       
   199 \end{tabular}
       
   200 \end{center}
       
   201 
       
   202 \noindent In each step I underlined where a simplification
       
   203 rule is applied. Our matching algorithm in the next section
       
   204 will often generate such ``useless'' $\epsilon$s and
       
   205 $\varnothing$s, therefore simplifying them away will make the
       
   206 algorithm quite a bit faster.
       
   207 
       
   208 \subsection*{The Matching Algorithm}
       
   209 
       
   210 The algorithm we will define below consists of two parts. One
       
   211 is the function $nullable$ which takes a regular expression as
       
   212 argument and decides whether it can match the empty string
       
   213 (this means it returns a boolean in Scala). This can be easily
       
   214 defined recursively as follows:
    36 
   215 
    37 \begin{center}
   216 \begin{center}
    38 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}
   217 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}
    39 $nullable(\varnothing)$      & $\dn$ & $f\!\/alse$\\
   218 $nullable(\varnothing)$      & $\dn$ & $\textit{false}$\\
    40 $nullable(\epsilon)$           & $\dn$ &  $true$\\
   219 $nullable(\epsilon)$         & $\dn$ & $true$\\
    41 $nullable(c)$                    & $\dn$ &  $f\!alse$\\
   220 $nullable(c)$                & $\dn$ & $\textit{false}$\\
    42 $nullable(r_1 + r_2)$       & $\dn$ &  $nullable(r_1) \vee nullable(r_2)$\\ 
   221 $nullable(r_1 + r_2)$     & $\dn$ &  $nullable(r_1) \vee nullable(r_2)$\\ 
    43 $nullable(r_1 \cdot r_2)$ & $\dn$ &  $nullable(r_1) \wedge nullable(r_2)$\\
   222 $nullable(r_1 \cdot r_2)$ & $\dn$ &  $nullable(r_1) \wedge nullable(r_2)$\\
    44 $nullable(r^*)$                & $\dn$ & $true$ \\
   223 $nullable(r^*)$              & $\dn$ & $true$ \\
    45 \end{tabular}
   224 \end{tabular}
    46 \end{center}
   225 \end{center}
    47 
   226 
    48 \noindent
   227 \noindent The idea behind this function is that the following
    49 The idea behind this function is that the following property holds:
   228 property holds:
    50 
   229 
    51 \[
   230 \[
    52 nullable(r) \;\;\text{if and only if}\;\; ""\in L(r)
   231 nullable(r) \;\;\text{if and only if}\;\; []\in L(r)
    53 \]
   232 \]
    54 
   233 
    55 \noindent
   234 \noindent Note on the left-hand side we have a function we can
    56 Note on the left-hand side we have a function we can implement; on the
   235 implement; on the right we have its specification (which we
    57 right we have its specification.
   236 cannot implement in a programming language).
    58 
   237 
    59 The other function of our matching algorithm calculates a
   238 The other function of our matching algorithm calculates a
    60 \emph{derivative} of a regular expression. This is a function which
   239 \emph{derivative} of a regular expression. This is a function
    61 will take a regular expression, say $r$, and a character, say $c$, as
   240 which will take a regular expression, say $r$, and a
    62 argument and return a new regular expression. Be careful that the
   241 character, say $c$, as argument and return a new regular
    63 intuition behind this function is not so easy to grasp on first
   242 expression. Be careful that the intuition behind this function
    64 reading. Essentially this function solves the following problem: if
   243 is not so easy to grasp on first reading. Essentially this
    65 $r$ can match a string of the form $c\!::\!s$, what does the regular
   244 function solves the following problem: if $r$ can match a
    66 expression look like that can match just $s$. The definition of this
   245 string of the form $c\!::\!s$, what does the regular
    67 function is as follows:
   246 expression look like that can match just $s$. The definition
    68 
   247 of this function is as follows:
    69 \begin{center}
   248 
    70 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {\hspace{-10mm}}l@ {}}
   249 \begin{center}
    71   $der\, c\, (\varnothing)$      & $\dn$ & $\varnothing$ & \\
   250 \begin{tabular}{l@ {\hspace{2mm}}c@ {\hspace{2mm}}l}
    72   $der\, c\, (\epsilon)$           & $\dn$ & $\varnothing$ & \\
   251   $der\, c\, (\varnothing)$      & $\dn$ & $\varnothing$\\
    73   $der\, c\, (d)$                     & $\dn$ & if $c = d$ then $\epsilon$ else $\varnothing$ & \\
   252   $der\, c\, (\epsilon)$         & $\dn$ & $\varnothing$ \\
    74   $der\, c\, (r_1 + r_2)$        & $\dn$ & $der\, c\, r_1 + der\, c\, r_2$ & \\
   253   $der\, c\, (d)$                & $\dn$ & if $c = d$ then $\epsilon$ else $\varnothing$\\
       
   254   $der\, c\, (r_1 + r_2)$        & $\dn$ & $der\, c\, r_1 + der\, c\, r_2$\\
    75   $der\, c\, (r_1 \cdot r_2)$  & $\dn$  & if $nullable (r_1)$\\
   255   $der\, c\, (r_1 \cdot r_2)$  & $\dn$  & if $nullable (r_1)$\\
    76   & & then $(der\,c\,r_1) \cdot r_2 + der\, c\, r_2$\\ 
   256   & & then $(der\,c\,r_1) \cdot r_2 + der\, c\, r_2$\\ 
    77   & & else $(der\, c\, r_1) \cdot r_2$\\
   257   & & else $(der\, c\, r_1) \cdot r_2$\\
    78   $der\, c\, (r^*)$          & $\dn$ & $(der\,c\,r) \cdot (r^*)$ &
   258   $der\, c\, (r^*)$          & $\dn$ & $(der\,c\,r) \cdot (r^*)$
    79   \end{tabular}
   259   \end{tabular}
    80 \end{center}
   260 \end{center}
    81 
   261 
    82 \noindent
   262 \noindent The first two clauses can be rationalised as
    83 The first two clauses can be rationalised as follows: recall that
   263 follows: recall that $der$ should calculate a regular
    84 $der$ should calculate a regular expression, if the ``input'' regular
   264 expression, if the ``input'' regular expression can match a
    85 expression can match a string of the form $c\!::\!s$. Since neither
   265 string of the form $c\!::\!s$. Since neither $\varnothing$ nor
    86 $\varnothing$ nor $\epsilon$ can match such a string we return
   266 $\epsilon$ can match such a string we return $\varnothing$. In
    87 $\varnothing$. In the third case we have to make a case-distinction:
   267 the third case we have to make a case-distinction: In case the
    88 In case the regular expression is $c$, then clearly it can recognise a
   268 regular expression is $c$, then clearly it can recognise a
    89 string of the form $c\!::\!s$, just that $s$ is the empty
   269 string of the form $c\!::\!s$, just that $s$ is the empty
    90 string. Therefore we return the $\epsilon$-regular expression. In the
   270 string. Therefore we return the $\epsilon$-regular expression.
    91 other case we again return $\varnothing$ since no string of the
   271 In the other case we again return $\varnothing$ since no
    92 $c\!::\!s$ can be matched.  The $+$-case is relatively
   272 string of the $c\!::\!s$ can be matched. Next come the
    93 straightforward: all strings of the form $c\!::\!s$ are either matched
   273 recursive cases. Fortunately, the $+$-case is still relatively
    94 by the regular expression $r_1$ or $r_2$. So we just have to
   274 straightforward: all strings of the form $c\!::\!s$ are either
    95 recursively call $der$ with these two regular expressions and compose
   275 matched by the regular expression $r_1$ or $r_2$. So we just
    96 the results again with $+$. The $\cdot$-case is more complicated: if
   276 have to recursively call $der$ with these two regular
    97 $r_1\cdot r_2$ matches a string of the form $c\!::\!s$, then the first
   277 expressions and compose the results again with $+$. Yes, makes
    98 part must be matched by $r_1$.  Consequently, it makes sense to
   278 sense? The $\cdot$-case is more complicated: if $r_1\cdot r_2$
    99 construct the regular expression for $s$ by calling $der$ with $r_1$
   279 matches a string of the form $c\!::\!s$, then the first part
   100 and ``appending'' $r_2$. There is however one exception to this simple
   280 must be matched by $r_1$. Consequently, it makes sense to
   101 rule: if $r_1$ can match the empty string, then all of $c\!::\!s$ is
   281 construct the regular expression for $s$ by calling $der$ with
   102 matched by $r_2$. So in case $r_1$ is nullable (that is can match the
   282 $r_1$ and ``appending'' $r_2$. There is however one exception
   103 empty string) we have to allow the choice $der\,c\,r_2$ for
   283 to this simple rule: if $r_1$ can match the empty string, then
   104 calculating the regular expression that can match $s$. The $*$-case is
   284 all of $c\!::\!s$ is matched by $r_2$. So in case $r_1$ is
   105 again simple: if $r^*$ matches a string of the form $c\!::\!s$, then
   285 nullable (that is can match the empty string) we have to allow
   106 the first part must be ``matched'' by a single copy of $r$. Therefore
   286 the choice $der\,c\,r_2$ for calculating the regular
   107 we call recursively $der\,c\,r$ and ``append'' $r^*$ in order to match
   287 expression that can match $s$. Therefore we have to 
   108 the rest of $s$.
   288 add the regular expression $der\,c\,r_2$.
   109 
   289 The $*$-case is again simple:
   110 Another way to rationalise the definition of $der$ is to consider the
   290 if $r^*$ matches a string of the form $c\!::\!s$, then the
   111 following operation on sets:
   291 first part must be ``matched'' by a single copy of $r$.
       
   292 Therefore we call recursively $der\,c\,r$ and ``append'' $r^*$
       
   293 in order to match the rest of $s$.
       
   294 
       
   295 If this did not make sense, here is another way to rationalise
       
   296 the definition of $der$ by considering the following operation
       
   297 on sets:
   112 
   298 
   113 \[
   299 \[
   114 Der\,c\,A\;\dn\;\{s\,|\,c\!::\!s \in A\}
   300 Der\,c\,A\;\dn\;\{s\,|\,c\!::\!s \in A\}
   115 \]
   301 \]
   116 
   302 
   117 \noindent
   303 \noindent
   118 which essentially transforms a set of strings $A$ by filtering out all
   304 which essentially transforms a set of strings $A$ by filtering out all
   119 strings that do not start with $c$ and then strips off the $c$ from
   305 strings that do not start with $c$ and then strips off the $c$ from
   120 all the remaining strings. For example suppose $A = \{"f\!oo", "bar",
   306 all the remaining strings. For example suppose $A = \{f\!oo, bar,
   121 "f\!rak"\}$ then
   307 f\!rak\}$ then
   122 \[
   308 \[
   123 Der\,f\,A = \{"oo", "rak"\}\quad,\quad
   309 Der\,f\,A = \{oo, rak\}\quad,\quad
   124 Der\,b\,A = \{"ar"\}  \quad \text{and} \quad
   310 Der\,b\,A = \{ar\}  \quad \text{and} \quad
   125 Der\,a\,A = \varnothing
   311 Der\,a\,A = \varnothing
   126 \]
   312 \]
   127  
   313  
   128 \noindent
   314 \noindent
   129 Note that in the last case $Der$ is empty, because no string in $A$
   315 Note that in the last case $Der$ is empty, because no string in $A$
   139 namely take the set of strings that $r$ can match (that is $L(r)$),
   325 namely take the set of strings that $r$ can match (that is $L(r)$),
   140 filter out all strings not starting with $c$ and strip off the $c$
   326 filter out all strings not starting with $c$ and strip off the $c$
   141 from the remaining strings---this is exactly the language that
   327 from the remaining strings---this is exactly the language that
   142 $der\,c\,r$ can match.
   328 $der\,c\,r$ can match.
   143 
   329 
   144 If we want to find out whether the string $"abc"$ is matched by the
   330 If we want to find out whether the string $abc$ is matched by
   145 regular expression $r$ then we can iteratively apply $Der$ as follows
   331 the regular expression $r_1$ then we can iteratively apply $der$
   146 
   332 as follows
   147 \begin{enumerate}
   333 
   148 \item $Der\,a\,(L(r))$
   334 \begin{center}
   149 \item $Der\,b\,(Der\,a\,(L(r)))$
   335 \begin{tabular}{rll}
   150 \item $Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$
   336 Input: $r_1$, $abc$\medskip\\
   151 \end{enumerate}
   337 Step 1: & build derivative of $a$ and $r_1$ & $(r_2 = der\,a\,r_1)$\smallskip\\
   152 
   338 Step 2: & build derivative of $b$ and $r_2$ & $(r_3 = der\,b\,r_2)$\smallskip\\
   153 \noindent
   339 Step 3: & build derivative of $c$ and $r_3$ & $(r_4 = der\,b\,r_3)$\smallskip\\
   154 In the last step we need to test whether the empty string is in the
   340 Step 4: & the string is exhausted; test & ($nullable(r_4)$)\\
   155 set. Our matching algorithm will work similarly, just using regular
   341         & whether $r_4$ can recognise the\\
   156 expression instead of sets. For this we need to lift the notion of
   342         & empty string\smallskip\\
   157 derivatives from characters to strings. This can be done using the
   343 Output: & result of the test $\Rightarrow true \,\text{or}\, \textit{false}$\\        
   158 following function, taking a string and regular expression as input
   344 \end{tabular}
   159 and a regular expression as output.
   345 \end{center}
       
   346 
       
   347 \noindent Again the operation $Der$ might help to rationalise
       
   348 this algorithm. We want to know whether $abc \in L(r_1)$. We
       
   349 do not know yet. But lets assume it is. Then $Der\,a\,L(r_1)$
       
   350 builds the set where all the strings not starting with $a$ are
       
   351 filtered out. Of the remaining strings, the $a$ is stripped
       
   352 off. Then we continue with filtering out all strings not
       
   353 starting with $b$ and stripping off the $b$ from the remaining
       
   354 strings, that means we build $Der\,b\,(Der\,a\,(L(r_1)))$.
       
   355 Finally we filter out all strings not starting with $c$ and
       
   356 strip off $c$ from the remaining string. This is
       
   357 $Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$. Now if $abc$ was in the 
       
   358 original set ($L(r_1)$), then in $Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$ 
       
   359 must be the empty string. If not then $abc$ was not in the 
       
   360 language we started with.
       
   361 
       
   362 Our matching algorithm using $der$ and $nullable$ works
       
   363 similarly, just using regular expression instead of sets. For
       
   364 this we need to extend the notion of derivatives from
       
   365 characters to strings. This can be done using the following
       
   366 function, taking a string and regular expression as input and
       
   367 a regular expression as output.
   160 
   368 
   161 \begin{center}
   369 \begin{center}
   162 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {\hspace{-10mm}}l@ {}}
   370 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {\hspace{-10mm}}l@ {}}
   163   $der\!s\, []\, r$     & $\dn$ & $r$ & \\
   371   $\textit{ders}\, []\, r$     & $\dn$ & $r$ & \\
   164   $der\!s\, (c\!::\!s)\, r$ & $\dn$ & $der\!s\,s\,(der\,c\,r)$ & \\
   372   $\textit{ders}\, (c\!::\!s)\, r$ & $\dn$ & $\textit{ders}\,s\,(der\,c\,r)$ & \\
   165   \end{tabular}
   373   \end{tabular}
   166 \end{center}
   374 \end{center}
   167 
   375 
   168 \noindent
   376 \noindent This function essentially iterates $der$ taking one
   169 Having $ders$ in place, we can finally define our matching algorithm:
   377 character at the time from the original string until it is
   170 
   378 exhausted. Having $ders$ in place, we can finally define our
   171 \[
   379 matching algorithm:
   172 match\,s\,r = nullable(ders\,s\,r)
   380 
   173 \]
   381 \[
   174 
   382 matches\,s\,r = nullable(ders\,s\,r)
   175 \noindent
   383 \]
   176 We claim that
   384 
   177 
   385 \noindent
   178 \[
   386 We can claim that
   179 match\,s\,r\quad\text{if and only if}\quad s\in L(r)
   387 
   180 \]
   388 \[
   181 
   389 matches\,s\,r\quad\text{if and only if}\quad s\in L(r)
   182 \noindent
   390 \]
   183 holds, which means our algorithm satisfies the specification. This
   391 
   184 algorithm was introduced by Janus Brzozowski in 1964. Its main
   392 \noindent holds, which means our algorithm satisfies the
   185 attractions are simplicity and being fast, as well as being easily
   393 specification. Of course we can claim many things\ldots
   186 extendable for other regular expressions such as $r^{\{n\}}$, $r^?$,
   394 whether the claim holds any water is a different question,
   187 $\sim{}r$ and so on.
   395 which for example is the point of the Strand-2 Coursework.
       
   396 
       
   397 This algorithm was introduced by Janus Brzozowski in 1964. Its
       
   398 main attractions are simplicity and being fast, as well as
       
   399 being easily extendable for other regular expressions such as
       
   400 $r^{\{n\}}$, $r^?$, $\sim{}r$ and so on (this is subject of
       
   401 Strand-1 Coursework 1).
   188 
   402 
   189 \subsection*{The Matching Algorithm in Scala}
   403 \subsection*{The Matching Algorithm in Scala}
   190 
   404 
   191 
   405 Another attraction of the algorithm is that it can be easily
       
   406 implemented in a functional programming language, like Scala.
       
   407 Given the implementation of regular expressions in Scala given
       
   408 in the first lecture and handout, the functions for
       
   409 \pcode{matches} are shown in Figure~\ref{scala1}.
   192 
   410 
   193 \begin{figure}[p]
   411 \begin{figure}[p]
   194 {\lstset{language=Scala}\texttt{\lstinputlisting{../progs/app5.scala}}}
   412 \lstinputlisting{../progs/app5.scala}
   195 {\lstset{language=Scala}\texttt{\lstinputlisting{../progs/app6.scala}}}
   413 \caption{Scala implementation of the nullable and 
   196 \caption{Scala implementation of the nullable and derivatives functions.}
   414   derivatives functions.\label{scala1}}
   197 \end{figure}
   415 \end{figure}
       
   416 
       
   417 For running the algorithm with our favourite example, the evil
       
   418 regular expression $a?^{\{n\}}a^{\{n\}}$, we need to implement
       
   419 the optional regular expression and the exactly $n$-times
       
   420 regular expression. This can be done with the translations
       
   421 
       
   422 \lstinputlisting[numbers=none]{../progs/app51.scala}
       
   423 
       
   424 \noindent Running the matcher with the example, we find it is
       
   425 slightly worse then the matcher in Ruby and Python.
       
   426 Ooops\ldots\medskip
       
   427 
       
   428 \noindent Analysing this failure a bit we notice that 
       
   429 for $a^{\{n\}}$ we generate quite big regular expressions:
       
   430 
       
   431 \begin{center}
       
   432 \begin{tabular}{rl}
       
   433 1: & $a$\\
       
   434 2: & $a\cdot a$\\
       
   435 3: & $a\cdot a\cdot a$\\
       
   436 & \ldots\\
       
   437 13: & $a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a$\\
       
   438 & \ldots\\
       
   439 20:
       
   440 \end{tabular}
       
   441 \end{center}
       
   442 
       
   443 \noindent Our algorithm traverses such regular expressions at
       
   444 least once every time a derivative is calculated. So having
       
   445 large regular expressions, will cause problems. This problem
       
   446 is aggravated with $a?$ being represented as $a + \epsilon$.
       
   447 
       
   448 
   198 
   449 
   199 \end{document}
   450 \end{document}
   200 
   451 
   201 %%% Local Variables: 
   452 %%% Local Variables: 
   202 %%% mode: latex
   453 %%% mode: latex