handouts/ho02.tex
changeset 412 1cef3924f7a2
parent 399 5c1fbb39c93e
child 414 065ca01b62ae
equal deleted inserted replaced
411:1aec0e1fda86 412:1cef3924f7a2
     9 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016}
     9 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016}
    10 
    10 
    11 
    11 
    12 \section*{Handout 2 (Regular Expression Matching)}
    12 \section*{Handout 2 (Regular Expression Matching)}
    13 
    13 
    14 This lecture is about implementing a more efficient regular
    14 This lecture is about implementing a more efficient regular expression
    15 expression matcher (the plots on the right)---more efficient
    15 matcher (the plots on the right)---more efficient than the matchers
    16 than the matchers from regular expression libraries in Ruby
    16 from regular expression libraries in Ruby, Python and Java (the plots
    17 and Python (the plots on the left). These plots show the
    17 on the left). The first pair of plots show the running time for the
    18 running time for the evil regular expression
    18 regular expressions $a^?{}^{\{n\}}\cdot a^{\{n\}}$ and strings composed
    19 $a^?{}^{\{n\}}\cdot a^{\{n\}}$ and strings composed of $n$
    19 of $n$ \pcode{a}s. The second pair of plots show the running time
    20 \pcode{a}s. We will use this regular expression and strings as
    20 for the regular expression $(a^*)^*\cdot b$ and also strings composed
    21 running example. To see the substantial differences in the two
    21 of $n$ \pcode{a}s (meaning this regular expression actually does not
    22 plots below, note the different scales of the $x$-axes. 
    22 match the strings).  To see the substantial differences in the left
       
    23 and right plots below, note the different scales of the $x$-axes. 
    23 
    24 
    24 
    25 
    25 \begin{center}
    26 \begin{center}
    26 \begin{tabular}{@{}cc@{}}
    27 \begin{tabular}{@{}cc@{}}
    27 \begin{tikzpicture}
    28 \begin{tikzpicture}
    28 \begin{axis}[
    29 \begin{axis}[
    29     xlabel={strings of {\tt a}s},
    30     xlabel={\small $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$},
    30     ylabel={time in secs},
    31     ylabel={\small time in secs},
    31     enlargelimits=false,
    32     enlargelimits=false,
    32     xtick={0,5,...,30},
    33     xtick={0,5,...,30},
    33     xmax=33,
    34     xmax=33,
    34     ymax=35,
    35     ymax=35,
    35     ytick={0,5,...,30},
    36     ytick={0,5,...,30},
    47 \end{axis}
    48 \end{axis}
    48 \end{tikzpicture}
    49 \end{tikzpicture}
    49 &
    50 &
    50 \begin{tikzpicture}
    51 \begin{tikzpicture}
    51   \begin{axis}[
    52   \begin{axis}[
    52     xlabel={strings of \texttt{a}s},
    53     xlabel={\small $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$},
       
    54     ylabel={\small time in secs},
       
    55     enlargelimits=false,
       
    56     xtick={0,3000,...,12000},
       
    57     xmax=12500,
       
    58     ymax=35,
       
    59     ytick={0,5,...,30},
       
    60     scaled ticks=false,
       
    61     axis lines=left,
       
    62     width=6.5cm,
       
    63     height=5cm]
       
    64 \addplot[green,mark=square*,mark options={fill=white}] table {re2b.data};
       
    65 \addplot[black,mark=square*,mark options={fill=white}] table {re3.data};
       
    66 \end{axis}
       
    67 \end{tikzpicture}
       
    68 \end{tabular}
       
    69 \end{center}
       
    70 
       
    71 \begin{center}
       
    72 \begin{tabular}{@{}cc@{}}
       
    73 \begin{tikzpicture}
       
    74 \begin{axis}[
       
    75     xlabel={$(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$},
       
    76     ylabel={time in secs},
       
    77     enlargelimits=false,
       
    78     xtick={0,5,...,30},
       
    79     xmax=33,
       
    80     ymax=35,
       
    81     ytick={0,5,...,30},
       
    82     scaled ticks=false,
       
    83     axis lines=left,
       
    84     width=5cm,
       
    85     height=5cm, 
       
    86     legend entries={Java},  
       
    87     legend pos=north west,
       
    88     legend cell align=left]
       
    89 \addplot[blue,mark=*, mark options={fill=white}] table {re-java.data};
       
    90 \end{axis}
       
    91 \end{tikzpicture}
       
    92 &
       
    93 \begin{tikzpicture}
       
    94   \begin{axis}[
       
    95     xlabel={$(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$}, 
    53     ylabel={time in secs},
    96     ylabel={time in secs},
    54     enlargelimits=false,
    97     enlargelimits=false,
    55     xtick={0,3000,...,12000},
    98     xtick={0,3000,...,12000},
    56     xmax=12500,
    99     xmax=12500,
    57     ymax=35,
   100     ymax=35,
    65 \end{axis}
   108 \end{axis}
    66 \end{tikzpicture}
   109 \end{tikzpicture}
    67 \end{tabular}
   110 \end{tabular}
    68 \end{center}\medskip
   111 \end{center}\medskip
    69 
   112 
    70 
   113 \noindent
    71 \noindent Having specified in the previous lecture what
   114 We will use these regular expressions and strings
       
   115 as running examples.
       
   116 
       
   117 Having specified in the previous lecture what
    72 problem our regular expression matcher is supposed to solve,
   118 problem our regular expression matcher is supposed to solve,
    73 namely for any given regular expression $r$ and string $s$
   119 namely for any given regular expression $r$ and string $s$
    74 answer \textit{true} if and only if
   120 answer \textit{true} if and only if
    75 
   121 
    76 \[
   122 \[
    77 s \in L(r)
   123 s \in L(r)
    78 \]
   124 \]
    79 
   125 
    80 \noindent we can look at an algorithm to solve this problem.
   126 \noindent we can look at an algorithm to solve this problem. Clearly
    81 Clearly we cannot use the function $L$ directly for this,
   127 we cannot use the function $L$ directly for this, because in general
    82 because in general the set of strings $L$ returns is infinite
   128 the set of strings $L$ returns is infinite (recall what $L(a^*)$ is).
    83 (recall what $L(a^*)$ is). In such cases there is no way we
   129 In such cases there is no way we can implement an exhaustive test for
    84 can implement an exhaustive test for whether a string is
   130 whether a string is member of this set or not. In contrast our
    85 member of this set or not. In contrast our matching algorithm
   131 matching algorithm will operate on the regular expression $r$ and
    86 will operate on the regular expression $r$ and string $s$,
   132 string $s$, only, which are both finite objects. Before we explain to
    87 only, which are both finite objects. Before we come to the matching
   133 the matching algorithm, however, let us have a closer look at what it
    88 algorithm, however, let us have a closer look at what it means
   134 means when two regular expressions are equivalent.
    89 when two regular expressions are equivalent.
       
    90 
   135 
    91 \subsection*{Regular Expression Equivalences}
   136 \subsection*{Regular Expression Equivalences}
    92 
   137 
    93 We already defined in Handout 1 what it means for two regular
   138 We already defined in Handout 1 what it means for two regular
    94 expressions to be equivalent, namely if their meaning is the
   139 expressions to be equivalent, namely if their meaning is the
   154 $\ZERO \cdot r$ & $\equiv$ & $\ZERO$\\
   199 $\ZERO \cdot r$ & $\equiv$ & $\ZERO$\\
   155 $r + r$ & $\equiv$ & $r$
   200 $r + r$ & $\equiv$ & $r$
   156 \end{tabular}
   201 \end{tabular}
   157 \end{center}
   202 \end{center}
   158 
   203 
   159 \noindent which always hold no matter what the regular
   204 \noindent which always hold no matter what the regular expression $r$
   160 expression $r$ looks like. The first two are easy to verify
   205 looks like. The first two are easy to verify since $L(\ZERO)$ is the
   161 since $L(\ZERO)$ is the empty set. The next two are also
   206 empty set. The next two are also easy to verify since $L(\ONE) =
   162 easy to verify since $L(\ONE) = \{[]\}$ and appending the
   207 \{[]\}$ and appending the empty string to every string of another set,
   163 empty string to every string of another set, leaves the set
   208 leaves the set unchanged. Be careful to fully comprehend the fifth and
   164 unchanged. Be careful to fully comprehend the fifth and sixth
   209 sixth equivalence: if you concatenate two sets of strings and one is
   165 equivalence: if you concatenate two sets of strings and one is
   210 the empty set, then the concatenation will also be the empty set. To
   166 the empty set, then the concatenation will also be the empty
   211 see this, check the definition of $\_ @ \_$ for sets. The last
   167 set. To see this, check the definition of $\_ @ \_$. The
   212 equivalence is again trivial.
   168 last equivalence is again trivial.
       
   169 
   213 
   170 What will be important later on is that we can orient these
   214 What will be important later on is that we can orient these
   171 equivalences and read them from left to right. In this way we
   215 equivalences and read them from left to right. In this way we
   172 can view them as \emph{simplification rules}. Consider for 
   216 can view them as \emph{simplification rules}. Consider for 
   173 example the regular expression 
   217 example the regular expression 
   175 \begin{equation}
   219 \begin{equation}
   176 (r_1 + \ZERO) \cdot \ONE + ((\ONE + r_2) + r_3) \cdot (r_4 \cdot \ZERO)
   220 (r_1 + \ZERO) \cdot \ONE + ((\ONE + r_2) + r_3) \cdot (r_4 \cdot \ZERO)
   177 \label{big}
   221 \label{big}
   178 \end{equation}
   222 \end{equation}
   179 
   223 
   180 \noindent If we can find an equivalent regular expression that
   224 \noindent If we can find an equivalent regular expression that is
   181 is simpler (smaller for example), then this might potentially
   225 simpler (smaller for example), then this might potentially make our
   182 make our matching algorithm run faster. The reason is that
   226 matching algorithm run faster. We can look for such a simpler regular
   183 whether a string $s$ is in $L(r)$ or in $L(r')$ with $r\equiv
   227 expression $r'$ because whether a string $s$ is in $L(r)$ or in
   184 r'$ will always give the same answer. In the example above you
   228 $L(r')$ with $r\equiv r'$ will always give the same answer. In the
   185 will see that the regular expression is equivalent to just $r_1$.
   229 example above you will see that the regular expression is equivalent
   186 You can verify this by iteratively applying the simplification
   230 to just $r_1$. You can verify this by iteratively applying the
   187 rules from above:
   231 simplification rules from above:
   188 
   232 
   189 \begin{center}
   233 \begin{center}
   190 \begin{tabular}{ll}
   234 \begin{tabular}{ll}
   191  & $(r_1 + \ZERO) \cdot \ONE + ((\ONE + r_2) + r_3) \cdot 
   235  & $(r_1 + \ZERO) \cdot \ONE + ((\ONE + r_2) + r_3) \cdot 
   192 (\underline{r_4 \cdot \ZERO})$\smallskip\\
   236 (\underline{r_4 \cdot \ZERO})$\smallskip\\
   206 algorithm quite a bit faster.
   250 algorithm quite a bit faster.
   207 
   251 
   208 \subsection*{The Matching Algorithm}
   252 \subsection*{The Matching Algorithm}
   209 
   253 
   210 The algorithm we will define below consists of two parts. One
   254 The algorithm we will define below consists of two parts. One
   211 is the function $nullable$ which takes a regular expression as
   255 is the function $\textit{nullable}$ which takes a regular expression as
   212 argument and decides whether it can match the empty string
   256 argument and decides whether it can match the empty string
   213 (this means it returns a boolean in Scala). This can be easily
   257 (this means it returns a boolean in Scala). This can be easily
   214 defined recursively as follows:
   258 defined recursively as follows:
   215 
   259 
   216 \begin{center}
   260 \begin{center}
   217 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}
   261 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}
   218 $nullable(\ZERO)$      & $\dn$ & $\textit{false}$\\
   262 $\textit{nullable}(\ZERO)$      & $\dn$ & $\textit{false}$\\
   219 $nullable(\ONE)$         & $\dn$ & $true$\\
   263 $\textit{nullable}(\ONE)$         & $\dn$ & $\textit{true}$\\
   220 $nullable(c)$                & $\dn$ & $\textit{false}$\\
   264 $\textit{nullable}(c)$                & $\dn$ & $\textit{false}$\\
   221 $nullable(r_1 + r_2)$     & $\dn$ &  $nullable(r_1) \vee nullable(r_2)$\\ 
   265 $\textit{nullable}(r_1 + r_2)$     & $\dn$ &  $\textit{nullable}(r_1) \vee \textit{nullable}(r_2)$\\ 
   222 $nullable(r_1 \cdot r_2)$ & $\dn$ &  $nullable(r_1) \wedge nullable(r_2)$\\
   266 $\textit{nullable}(r_1 \cdot r_2)$ & $\dn$ &  $\textit{nullable}(r_1) \wedge \textit{nullable}(r_2)$\\
   223 $nullable(r^*)$              & $\dn$ & $true$ \\
   267 $\textit{nullable}(r^*)$              & $\dn$ & $\textit{true}$ \\
   224 \end{tabular}
   268 \end{tabular}
   225 \end{center}
   269 \end{center}
   226 
   270 
   227 \noindent The idea behind this function is that the following
   271 \noindent The idea behind this function is that the following
   228 property holds:
   272 property holds:
   229 
   273 
   230 \[
   274 \[
   231 nullable(r) \;\;\text{if and only if}\;\; []\in L(r)
   275 \textit{nullable}(r) \;\;\text{if and only if}\;\; []\in L(r)
   232 \]
   276 \]
   233 
   277 
   234 \noindent Note on the left-hand side of the if-and-only-if we
   278 \noindent Note on the left-hand side of the if-and-only-if we
   235 have a function we can implement; on the right we have its
   279 have a function we can implement; on the right we have its
   236 specification (which we cannot implement in a programming
   280 specification (which we cannot implement in a programming
   237 language).
   281 language).
   238 
   282 
   239 The other function of our matching algorithm calculates a
   283 The other function of our matching algorithm calculates a
   240 \emph{derivative} of a regular expression. This is a function
   284 \emph{derivative} of a regular expression. This is a function
   241 which will take a regular expression, say $r$, and a
   285 which will take a regular expression, say $r$, and a
   242 character, say $c$, as argument and returns a new regular
   286 character, say $c$, as arguments and returns a new regular
   243 expression. Be careful that the intuition behind this function
   287 expression. Be careful that the intuition behind this function
   244 is not so easy to grasp on first reading. Essentially this
   288 is not so easy to grasp on first reading. Essentially this
   245 function solves the following problem: if $r$ can match a
   289 function solves the following problem: if $r$ can match a
   246 string of the form $c\!::\!s$, what does the regular
   290 string of the form $c\!::\!s$, what does the regular
   247 expression look like that can match just $s$? The definition
   291 expression look like that can match just $s$? The definition
   251 \begin{tabular}{l@ {\hspace{2mm}}c@ {\hspace{2mm}}l}
   295 \begin{tabular}{l@ {\hspace{2mm}}c@ {\hspace{2mm}}l}
   252   $der\, c\, (\ZERO)$      & $\dn$ & $\ZERO$\\
   296   $der\, c\, (\ZERO)$      & $\dn$ & $\ZERO$\\
   253   $der\, c\, (\ONE)$         & $\dn$ & $\ZERO$ \\
   297   $der\, c\, (\ONE)$         & $\dn$ & $\ZERO$ \\
   254   $der\, c\, (d)$                & $\dn$ & if $c = d$ then $\ONE$ else $\ZERO$\\
   298   $der\, c\, (d)$                & $\dn$ & if $c = d$ then $\ONE$ else $\ZERO$\\
   255   $der\, c\, (r_1 + r_2)$        & $\dn$ & $der\, c\, r_1 + der\, c\, r_2$\\
   299   $der\, c\, (r_1 + r_2)$        & $\dn$ & $der\, c\, r_1 + der\, c\, r_2$\\
   256   $der\, c\, (r_1 \cdot r_2)$  & $\dn$  & if $nullable (r_1)$\\
   300   $der\, c\, (r_1 \cdot r_2)$  & $\dn$  & if $\textit{nullable} (r_1)$\\
   257   & & then $(der\,c\,r_1) \cdot r_2 + der\, c\, r_2$\\ 
   301   & & then $(der\,c\,r_1) \cdot r_2 + der\, c\, r_2$\\ 
   258   & & else $(der\, c\, r_1) \cdot r_2$\\
   302   & & else $(der\, c\, r_1) \cdot r_2$\\
   259   $der\, c\, (r^*)$          & $\dn$ & $(der\,c\,r) \cdot (r^*)$
   303   $der\, c\, (r^*)$          & $\dn$ & $(der\,c\,r) \cdot (r^*)$
   260   \end{tabular}
   304   \end{tabular}
   261 \end{center}
   305 \end{center}
   276 involved. Fortunately, the $+$-case is still relatively
   320 involved. Fortunately, the $+$-case is still relatively
   277 straightforward: all strings of the form $c\!::\!s$ are either
   321 straightforward: all strings of the form $c\!::\!s$ are either
   278 matched by the regular expression $r_1$ or $r_2$. So we just
   322 matched by the regular expression $r_1$ or $r_2$. So we just
   279 have to recursively call $der$ with these two regular
   323 have to recursively call $der$ with these two regular
   280 expressions and compose the results again with $+$. Makes
   324 expressions and compose the results again with $+$. Makes
   281 sense? The $\cdot$-case is more complicated: if $r_1\cdot r_2$
   325 sense? 
       
   326 
       
   327 The $\cdot$-case is more complicated: if $r_1\cdot r_2$
   282 matches a string of the form $c\!::\!s$, then the first part
   328 matches a string of the form $c\!::\!s$, then the first part
   283 must be matched by $r_1$. Consequently, it makes sense to
   329 must be matched by $r_1$. Consequently, it makes sense to
   284 construct the regular expression for $s$ by calling $der$ with
   330 construct the regular expression for $s$ by calling $der$ with
   285 $r_1$ and ``appending'' $r_2$. There is however one exception
   331 $r_1$ and ``appending'' $r_2$. There is however one exception
   286 to this simple rule: if $r_1$ can match the empty string, then
   332 to this simple rule: if $r_1$ can match the empty string, then
   292 is again simple: if $r^*$ matches a string of the form
   338 is again simple: if $r^*$ matches a string of the form
   293 $c\!::\!s$, then the first part must be ``matched'' by a
   339 $c\!::\!s$, then the first part must be ``matched'' by a
   294 single copy of $r$. Therefore we call recursively $der\,c\,r$
   340 single copy of $r$. Therefore we call recursively $der\,c\,r$
   295 and ``append'' $r^*$ in order to match the rest of $s$.
   341 and ``append'' $r^*$ in order to match the rest of $s$.
   296 
   342 
   297 If this did not make sense, here is another way to rationalise
   343 If this did not make sense yet, here is another way to rationalise
   298 the definition of $der$ by considering the following operation
   344 the definition of $der$ by considering the following operation
   299 on sets:
   345 on sets:
   300 
   346 
   301 \begin{equation}\label{Der}
   347 \begin{equation}\label{Der}
   302 Der\,c\,A\;\dn\;\{s\,|\,c\!::\!s \in A\}
   348 Der\,c\,A\;\dn\;\{s\,|\,c\!::\!s \in A\}
   336 \begin{tabular}{rll}
   382 \begin{tabular}{rll}
   337 Input: $r_1$, $abc$\medskip\\
   383 Input: $r_1$, $abc$\medskip\\
   338 Step 1: & build derivative of $a$ and $r_1$ & $(r_2 = der\,a\,r_1)$\smallskip\\
   384 Step 1: & build derivative of $a$ and $r_1$ & $(r_2 = der\,a\,r_1)$\smallskip\\
   339 Step 2: & build derivative of $b$ and $r_2$ & $(r_3 = der\,b\,r_2)$\smallskip\\
   385 Step 2: & build derivative of $b$ and $r_2$ & $(r_3 = der\,b\,r_2)$\smallskip\\
   340 Step 3: & build derivative of $c$ and $r_3$ & $(r_4 = der\,b\,r_3)$\smallskip\\
   386 Step 3: & build derivative of $c$ and $r_3$ & $(r_4 = der\,b\,r_3)$\smallskip\\
   341 Step 4: & the string is exhausted; test & ($nullable(r_4)$)\\
   387 Step 4: & the string is exhausted; test & ($\textit{nullable}(r_4)$)\\
   342         & whether $r_4$ can recognise the\\
   388         & whether $r_4$ can recognise the\\
   343         & empty string\smallskip\\
   389         & empty string\smallskip\\
   344 Output: & result of this test $\Rightarrow true \,\text{or}\, \textit{false}$\\        
   390 Output: & result of this test $\Rightarrow \textit{true} \,\text{or}\, \textit{false}$\\        
   345 \end{tabular}
   391 \end{tabular}
   346 \end{center}
   392 \end{center}
   347 
   393 
   348 \noindent Again the operation $Der$ might help to rationalise
   394 \noindent Again the operation $Der$ might help to rationalise
   349 this algorithm. We want to know whether $abc \in L(r_1)$. We
   395 this algorithm. We want to know whether $abc \in L(r_1)$. We
   350 do not know yet---but let us assume it is. Then $Der\,a\,L(r_1)$
   396 do not know yet---but let us assume it is. Then $Der\,a\,L(r_1)$
   351 builds the set where all the strings not starting with $a$ are
   397 builds the set where all the strings not starting with $a$ are
   352 filtered out. Of the remaining strings, the $a$ is stripped
   398 filtered out. Of the remaining strings, the $a$ is stripped
   353 off. Then we continue with filtering out all strings not
   399 off. So we should still have $bc$ in the set.
       
   400 Then we continue with filtering out all strings not
   354 starting with $b$ and stripping off the $b$ from the remaining
   401 starting with $b$ and stripping off the $b$ from the remaining
   355 strings, that means we build $Der\,b\,(Der\,a\,(L(r_1)))$.
   402 strings, that means we build $Der\,b\,(Der\,a\,(L(r_1)))$.
   356 Finally we filter out all strings not starting with $c$ and
   403 Finally we filter out all strings not starting with $c$ and
   357 strip off $c$ from the remaining string. This is
   404 strip off $c$ from the remaining string. This is
   358 $Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$. Now if $abc$ was in the 
   405 $Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$. Now if $abc$ was in the 
   359 original set ($L(r_1)$), then in $Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$ 
   406 original set ($L(r_1)$), then in $Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$ 
   360 must be the empty string. If not, then $abc$ was not in the 
   407 must contain the empty string. If not, then $abc$ was not in the 
   361 language we started with.
   408 language we started with.
   362 
   409 
   363 Our matching algorithm using $der$ and $nullable$ works
   410 Our matching algorithm using $der$ and $\textit{nullable}$ works
   364 similarly, just using regular expression instead of sets. For
   411 similarly, just using regular expression instead of sets. For
   365 this we need to extend the notion of derivatives from single
   412 this we need to extend the notion of derivatives from single
   366 characters to strings. This can be done using the following
   413 characters to strings. This can be done using the following
   367 function, taking a string and regular expression as input and
   414 function, taking a string and regular expression as input and
   368 a regular expression as output.
   415 a regular expression as output.
   378 the time from the original string until it is exhausted.
   425 the time from the original string until it is exhausted.
   379 Having $ders$ in place, we can finally define our matching
   426 Having $ders$ in place, we can finally define our matching
   380 algorithm:
   427 algorithm:
   381 
   428 
   382 \[
   429 \[
   383 matches\,s\,r \dn nullable(ders\,s\,r)
   430 matches\,s\,r \dn \textit{nullable}(ders\,s\,r)
   384 \]
   431 \]
   385 
   432 
   386 \noindent
   433 \noindent
   387 and we can claim that
   434 and we can claim that
   388 
   435 
   409 in the first lecture and handout, the functions and subfunctions
   456 in the first lecture and handout, the functions and subfunctions
   410 for \pcode{matches} are shown in Figure~\ref{scala1}.
   457 for \pcode{matches} are shown in Figure~\ref{scala1}.
   411 
   458 
   412 \begin{figure}[p]
   459 \begin{figure}[p]
   413 \lstinputlisting{../progs/app5.scala}
   460 \lstinputlisting{../progs/app5.scala}
   414 \caption{Scala implementation of the nullable and 
   461 \caption{Scala implementation of the \textit{nullable} and 
   415   derivative functions. These functions are easy to
   462   derivative functions. These functions are easy to
   416   implement in functional languages, because pattern 
   463   implement in functional languages, because their built-in pattern 
   417   matching and recursion allow us to mimic the mathematical
   464   matching and recursion allow us to mimic the mathematical
   418   definitions very closely.\label{scala1}}
   465   definitions very closely.\label{scala1}}
   419 \end{figure}
   466 \end{figure}
   420 
   467 
   421 For running the algorithm with our favourite example, the evil
   468 For running the algorithm with our favourite example, the evil
   483 \end{center}
   530 \end{center}
   484 
   531 
   485 \noindent With this fix we have a constant ``size'' regular
   532 \noindent With this fix we have a constant ``size'' regular
   486 expression for our running example no matter how large $n$ is.
   533 expression for our running example no matter how large $n$ is.
   487 This means we have to also add cases for \pcode{NTIMES} in the
   534 This means we have to also add cases for \pcode{NTIMES} in the
   488 functions $nullable$ and $der$. Does the change have any
   535 functions $\textit{nullable}$ and $der$. Does the change have any
   489 effect?
   536 effect?
   490 
   537 
   491 \begin{center}
   538 \begin{center}
   492 \begin{tikzpicture}
   539 \begin{tikzpicture}
   493 \begin{axis}[
   540 \begin{axis}[
   521 can within 30 seconds handle regular expressions up to
   568 can within 30 seconds handle regular expressions up to
   522 $n = 950$ before a StackOverflow is raised. Python and Ruby 
   569 $n = 950$ before a StackOverflow is raised. Python and Ruby 
   523 (and our first version) could only handle $n = 27$ or so in 30 
   570 (and our first version) could only handle $n = 27$ or so in 30 
   524 second.
   571 second.
   525 
   572 
       
   573 SECOND EXAMPLE
       
   574 
   526 The moral is that our algorithm is rather sensitive to the
   575 The moral is that our algorithm is rather sensitive to the
   527 size of regular expressions it needs to handle. This is of
   576 size of regular expressions it needs to handle. This is of
   528 course obvious because both $nullable$ and $der$ frequently
   577 course obvious because both $\textit{nullable}$ and $der$ frequently
   529 need to traverse the whole regular expression. There seems,
   578 need to traverse the whole regular expression. There seems,
   530 however, one more issue for making the algorithm run faster.
   579 however, one more issue for making the algorithm run faster.
   531 The derivative function often produces ``useless''
   580 The derivative function often produces ``useless''
   532 $\ZERO$s and $\ONE$s. To see this, consider $r = ((a
   581 $\ZERO$s and $\ONE$s. To see this, consider $r = ((a
   533 \cdot b) + b)^*$ and the following two derivatives
   582 \cdot b) + b)^*$ and the following two derivatives
   593 \addplot[black,mark=square*,mark options={fill=white}] table {re3.data};
   642 \addplot[black,mark=square*,mark options={fill=white}] table {re3.data};
   594 \end{axis}
   643 \end{axis}
   595 \end{tikzpicture}
   644 \end{tikzpicture}
   596 \end{center}
   645 \end{center}
   597 
   646 
       
   647 SECOND EXAMPLE
       
   648 
   598 \section*{Proofs}
   649 \section*{Proofs}
   599 
   650 
   600 You might not like doing proofs. But they serve a very
   651 You might not like doing proofs. But they serve a very
   601 important purpose in Computer Science: How can we be sure that
   652 important purpose in Computer Science: How can we be sure that
   602 our algorithm matches its specification. We can try to test
   653 our algorithm matches its specification. We can try to test
   638 \noindent 
   689 \noindent 
   639 A simple proof is for example showing the following 
   690 A simple proof is for example showing the following 
   640 property:
   691 property:
   641 
   692 
   642 \begin{equation}
   693 \begin{equation}
   643 nullable(r) \;\;\text{if and only if}\;\; []\in L(r)
   694 \textit{nullable}(r) \;\;\text{if and only if}\;\; []\in L(r)
   644 \label{nullableprop}
   695 \label{nullableprop}
   645 \end{equation}
   696 \end{equation}
   646 
   697 
   647 \noindent
   698 \noindent
   648 Let us say that this property is $P(r)$, then the first case
   699 Let us say that this property is $P(r)$, then the first case
   649 we need to check is whether $P(\ZERO)$ (see recipe 
   700 we need to check is whether $P(\ZERO)$ (see recipe 
   650 above). So we have to show that
   701 above). So we have to show that
   651 
   702 
   652 \[
   703 \[
   653 nullable(\ZERO) \;\;\text{if and only if}\;\; 
   704 \textit{nullable}(\ZERO) \;\;\text{if and only if}\;\; 
   654 []\in L(\ZERO)
   705 []\in L(\ZERO)
   655 \]
   706 \]
   656 
   707 
   657 \noindent whereby $nullable(\ZERO)$ is by definition of
   708 \noindent whereby $\textit{nullable}(\ZERO)$ is by definition of
   658 the function $nullable$ always $\textit{false}$. We also have
   709 the function $\textit{nullable}$ always $\textit{false}$. We also have
   659 that $L(\ZERO)$ is by definition $\{\}$. It is
   710 that $L(\ZERO)$ is by definition $\{\}$. It is
   660 impossible that the empty string $[]$ is in the empty set.
   711 impossible that the empty string $[]$ is in the empty set.
   661 Therefore also the right-hand side is false. Consequently we
   712 Therefore also the right-hand side is false. Consequently we
   662 verified this case: both sides are false. We would still need
   713 verified this case: both sides are false. We would still need
   663 to do this for $P(\ONE)$ and $P(c)$. I leave this to
   714 to do this for $P(\ONE)$ and $P(c)$. I leave this to
   665 
   716 
   666 Next we need to check the inductive cases, for example
   717 Next we need to check the inductive cases, for example
   667 $P(r_1 + r_2)$, which is
   718 $P(r_1 + r_2)$, which is
   668 
   719 
   669 \begin{equation}
   720 \begin{equation}
   670 nullable(r_1 + r_2) \;\;\text{if and only if}\;\; 
   721 \textit{nullable}(r_1 + r_2) \;\;\text{if and only if}\;\; 
   671 []\in L(r_1 + r_2)
   722 []\in L(r_1 + r_2)
   672 \label{propalt}
   723 \label{propalt}
   673 \end{equation}
   724 \end{equation}
   674 
   725 
   675 \noindent The difference to the base cases is that in this
   726 \noindent The difference to the base cases is that in this
   676 case we can already assume we proved
   727 case we can already assume we proved
   677 
   728 
   678 \begin{center}
   729 \begin{center}
   679 \begin{tabular}{l}
   730 \begin{tabular}{l}
   680 $nullable(r_1) \;\;\text{if and only if}\;\; []\in L(r_1)$ and\\
   731 $\textit{nullable}(r_1) \;\;\text{if and only if}\;\; []\in L(r_1)$ and\\
   681 $nullable(r_2) \;\;\text{if and only if}\;\; []\in L(r_2)$\\
   732 $\textit{nullable}(r_2) \;\;\text{if and only if}\;\; []\in L(r_2)$\\
   682 \end{tabular}
   733 \end{tabular}
   683 \end{center}
   734 \end{center}
   684 
   735 
   685 \noindent These are the induction hypotheses. To check this 
   736 \noindent These are the induction hypotheses. To check this 
   686 case, we can start from $nullable(r_1 + r_2)$, which by 
   737 case, we can start from $\textit{nullable}(r_1 + r_2)$, which by 
   687 definition is
   738 definition is
   688 
   739 
   689 \[
   740 \[
   690 nullable(r_1) \vee nullable(r_2)
   741 \textit{nullable}(r_1) \vee \textit{nullable}(r_2)
   691 \]
   742 \]
   692 
   743 
   693 \noindent Using the two induction hypotheses from above,
   744 \noindent Using the two induction hypotheses from above,
   694 we can transform this into 
   745 we can transform this into 
   695 
   746 
   696 \[
   747 \[
   697 [] \in L(r_1) \vee []\in(r_2)
   748 [] \in L(r_1) \vee []\in(r_2)
   698 \]
   749 \]
   699 
   750 
   700 \noindent We just replaced the $nullable(\ldots)$ parts by
   751 \noindent We just replaced the $\textit{nullable}(\ldots)$ parts by
   701 the equivalent $[] \in L(\ldots)$ from the induction 
   752 the equivalent $[] \in L(\ldots)$ from the induction 
   702 hypotheses. A bit of thinking convinces you that if
   753 hypotheses. A bit of thinking convinces you that if
   703 $[] \in L(r_1) \vee []\in L(r_2)$ then the empty string
   754 $[] \in L(r_1) \vee []\in L(r_2)$ then the empty string
   704 must be in the union $L(r_1)\cup L(r_2)$, that is
   755 must be in the union $L(r_1)\cup L(r_2)$, that is
   705 
   756 
   708 \]
   759 \]
   709 
   760 
   710 \noindent but this is by definition of $L$ exactly $[] \in
   761 \noindent but this is by definition of $L$ exactly $[] \in
   711 L(r_1 + r_2)$, which we needed to establish according to
   762 L(r_1 + r_2)$, which we needed to establish according to
   712 \eqref{propalt}. What we have shown is that starting from
   763 \eqref{propalt}. What we have shown is that starting from
   713 $nullable(r_1 + r_2)$ we have done equivalent transformations
   764 $\textit{nullable}(r_1 + r_2)$ we have done equivalent transformations
   714 to end up with $[] \in L(r_1 + r_2)$. Consequently we have
   765 to end up with $[] \in L(r_1 + r_2)$. Consequently we have
   715 established that $P(r_1 + r_2)$ holds.
   766 established that $P(r_1 + r_2)$ holds.
   716 
   767 
   717 In order to complete the proof we would now need to look 
   768 In order to complete the proof we would now need to look 
   718 at the cases \mbox{$P(r_1\cdot r_2)$} and $P(r^*)$. Again I let you
   769 at the cases \mbox{$P(r_1\cdot r_2)$} and $P(r^*)$. Again I let you
   782 [] \in L(ders\,s\,r)
   833 [] \in L(ders\,s\,r)
   783 \label{prefinalstep}
   834 \label{prefinalstep}
   784 \end{equation}
   835 \end{equation}
   785 
   836 
   786 \noindent We have also shown that testing whether the empty
   837 \noindent We have also shown that testing whether the empty
   787 string is in a language is equivalent to the $nullable$
   838 string is in a language is equivalent to the $\textit{nullable}$
   788 function; see \eqref{nullableprop}. That means
   839 function; see \eqref{nullableprop}. That means
   789 \eqref{prefinalstep} is equivalent with
   840 \eqref{prefinalstep} is equivalent with
   790  
   841  
   791 \[
   842 \[
   792 nullable(ders\,s\,r)
   843 \textit{nullable}(ders\,s\,r)
   793 \] 
   844 \] 
   794 
   845 
   795 \noindent But this is just the definition of $matches$
   846 \noindent But this is just the definition of $matches$
   796 
   847 
   797 \[
   848 \[