291 expression look like that can match just $s$? The definition |
297 expression look like that can match just $s$? The definition |
292 of this function is as follows: |
298 of this function is as follows: |
293 |
299 |
294 \begin{center} |
300 \begin{center} |
295 \begin{tabular}{l@ {\hspace{2mm}}c@ {\hspace{2mm}}l} |
301 \begin{tabular}{l@ {\hspace{2mm}}c@ {\hspace{2mm}}l} |
296 $der\, c\, (\ZERO)$ & $\dn$ & $\ZERO$\\ |
302 $\textit{der}\, c\, (\ZERO)$ & $\dn$ & $\ZERO$\\ |
297 $der\, c\, (\ONE)$ & $\dn$ & $\ZERO$ \\ |
303 $\textit{der}\, c\, (\ONE)$ & $\dn$ & $\ZERO$ \\ |
298 $der\, c\, (d)$ & $\dn$ & if $c = d$ then $\ONE$ else $\ZERO$\\ |
304 $\textit{der}\, c\, (d)$ & $\dn$ & if $c = d$ then $\ONE$ else $\ZERO$\\ |
299 $der\, c\, (r_1 + r_2)$ & $\dn$ & $der\, c\, r_1 + der\, c\, r_2$\\ |
305 $\textit{der}\, c\, (r_1 + r_2)$ & $\dn$ & $\textit{der}\, c\, r_1 + \textit{der}\, c\, r_2$\\ |
300 $der\, c\, (r_1 \cdot r_2)$ & $\dn$ & if $\textit{nullable} (r_1)$\\ |
306 $\textit{der}\, c\, (r_1 \cdot r_2)$ & $\dn$ & if $\textit{nullable} (r_1)$\\ |
301 & & then $(der\,c\,r_1) \cdot r_2 + der\, c\, r_2$\\ |
307 & & then $(\textit{der}\,c\,r_1) \cdot r_2 + \textit{der}\, c\, r_2$\\ |
302 & & else $(der\, c\, r_1) \cdot r_2$\\ |
308 & & else $(\textit{der}\, c\, r_1) \cdot r_2$\\ |
303 $der\, c\, (r^*)$ & $\dn$ & $(der\,c\,r) \cdot (r^*)$ |
309 $\textit{der}\, c\, (r^*)$ & $\dn$ & $(\textit{der}\,c\,r) \cdot (r^*)$ |
304 \end{tabular} |
310 \end{tabular} |
305 \end{center} |
311 \end{center} |
306 |
312 |
307 \noindent The first two clauses can be rationalised as |
313 \noindent The first two clauses can be rationalised as |
308 follows: recall that $der$ should calculate a regular |
314 follows: recall that $\textit{der}$ should calculate a regular |
309 expression so that given the ``input'' regular expression can |
315 expression so that given the ``input'' regular expression can |
310 match a string of the form $c\!::\!s$, we want a regular |
316 match a string of the form $c\!::\!s$, we want a regular |
311 expression for $s$. Since neither $\ZERO$ nor $\ONE$ |
317 expression for $s$. Since neither $\ZERO$ nor $\ONE$ |
312 can match a string of the form $c\!::\!s$, we return |
318 can match a string of the form $c\!::\!s$, we return |
313 $\ZERO$. In the third case we have to make a |
319 $\ZERO$. In the third case we have to make a |
318 return $\ZERO$ since no string of the $c\!::\!s$ can be |
324 return $\ZERO$ since no string of the $c\!::\!s$ can be |
319 matched. Next come the recursive cases, which are a bit more |
325 matched. Next come the recursive cases, which are a bit more |
320 involved. Fortunately, the $+$-case is still relatively |
326 involved. Fortunately, the $+$-case is still relatively |
321 straightforward: all strings of the form $c\!::\!s$ are either |
327 straightforward: all strings of the form $c\!::\!s$ are either |
322 matched by the regular expression $r_1$ or $r_2$. So we just |
328 matched by the regular expression $r_1$ or $r_2$. So we just |
323 have to recursively call $der$ with these two regular |
329 have to recursively call $\textit{der}$ with these two regular |
324 expressions and compose the results again with $+$. Makes |
330 expressions and compose the results again with $+$. Makes |
325 sense? |
331 sense? |
326 |
332 |
327 The $\cdot$-case is more complicated: if $r_1\cdot r_2$ |
333 The $\cdot$-case is more complicated: if $r_1\cdot r_2$ |
328 matches a string of the form $c\!::\!s$, then the first part |
334 matches a string of the form $c\!::\!s$, then the first part |
329 must be matched by $r_1$. Consequently, it makes sense to |
335 must be matched by $r_1$. Consequently, it makes sense to |
330 construct the regular expression for $s$ by calling $der$ with |
336 construct the regular expression for $s$ by calling $\textit{der}$ with |
331 $r_1$ and ``appending'' $r_2$. There is however one exception |
337 $r_1$ and ``appending'' $r_2$. There is however one exception |
332 to this simple rule: if $r_1$ can match the empty string, then |
338 to this simple rule: if $r_1$ can match the empty string, then |
333 all of $c\!::\!s$ is matched by $r_2$. So in case $r_1$ is |
339 all of $c\!::\!s$ is matched by $r_2$. So in case $r_1$ is |
334 nullable (that is can match the empty string) we have to allow |
340 nullable (that is can match the empty string) we have to allow |
335 the choice $der\,c\,r_2$ for calculating the regular |
341 the choice $\textit{der}\,c\,r_2$ for calculating the regular |
336 expression that can match $s$. Therefore we have to add the |
342 expression that can match $s$. Therefore we have to add the |
337 regular expression $der\,c\,r_2$ in the result. The $*$-case |
343 regular expression $\textit{der}\,c\,r_2$ in the result. The $*$-case |
338 is again simple: if $r^*$ matches a string of the form |
344 is again simple: if $r^*$ matches a string of the form |
339 $c\!::\!s$, then the first part must be ``matched'' by a |
345 $c\!::\!s$, then the first part must be ``matched'' by a |
340 single copy of $r$. Therefore we call recursively $der\,c\,r$ |
346 single copy of $r$. Therefore we call recursively $\textit{der}\,c\,r$ |
341 and ``append'' $r^*$ in order to match the rest of $s$. |
347 and ``append'' $r^*$ in order to match the rest of $s$. Still |
342 |
348 makes sense? |
343 If this did not make sense yet, here is another way to rationalise |
349 |
344 the definition of $der$ by considering the following operation |
350 If all this did not make sense yet, here is another way to rationalise |
|
351 the definition of $\textit{der}$ by considering the following operation |
345 on sets: |
352 on sets: |
346 |
353 |
347 \begin{equation}\label{Der} |
354 \begin{equation}\label{Der} |
348 Der\,c\,A\;\dn\;\{s\,|\,c\!::\!s \in A\} |
355 \textit{Der}\,c\,A\;\dn\;\{s\,|\,c\!::\!s \in A\} |
349 \end{equation} |
356 \end{equation} |
350 |
357 |
351 \noindent This operation essentially transforms a set of |
358 \noindent This operation essentially transforms a set of |
352 strings $A$ by filtering out all strings that do not start |
359 strings $A$ by filtering out all strings that do not start |
353 with $c$ and then strips off the $c$ from all the remaining |
360 with $c$ and then strips off the $c$ from all the remaining |
354 strings. For example suppose $A = \{f\!oo, bar, f\!rak\}$ then |
361 strings. For example suppose $A = \{f\!oo, bar, f\!rak\}$ then |
355 |
362 |
356 \[ Der\,f\,A = \{oo, rak\}\quad,\quad |
363 \[ \textit{Der}\,f\,A = \{oo, rak\}\quad,\quad |
357 Der\,b\,A = \{ar\} \quad \text{and} \quad |
364 \textit{Der}\,b\,A = \{ar\} \quad \text{and} \quad |
358 Der\,a\,A = \{\} |
365 \textit{Der}\,a\,A = \{\} |
359 \] |
366 \] |
360 |
367 |
361 \noindent |
368 \noindent |
362 Note that in the last case $Der$ is empty, because no string in $A$ |
369 Note that in the last case $\textit{Der}$ is empty, because no string in $A$ |
363 starts with $a$. With this operation we can state the following |
370 starts with $a$. With this operation we can state the following |
364 property about $der$: |
371 property about $\textit{der}$: |
365 |
372 |
366 \[ |
373 \[ |
367 L(der\,c\,r) = Der\,c\,(L(r)) |
374 L(\textit{der}\,c\,r) = \textit{Der}\,c\,(L(r)) |
368 \] |
375 \] |
369 |
376 |
370 \noindent |
377 \noindent |
371 This property clarifies what regular expression $der$ calculates, |
378 This property clarifies what regular expression $\textit{der}$ calculates, |
372 namely take the set of strings that $r$ can match (that is $L(r)$), |
379 namely take the set of strings that $r$ can match (that is $L(r)$), |
373 filter out all strings not starting with $c$ and strip off the $c$ |
380 filter out all strings not starting with $c$ and strip off the $c$ |
374 from the remaining strings---this is exactly the language that |
381 from the remaining strings---this is exactly the language that |
375 $der\,c\,r$ can match. |
382 $\textit{der}\,c\,r$ can match. |
376 |
383 |
377 If we want to find out whether the string $abc$ is matched by |
384 If we want to find out whether the string $abc$ is matched by |
378 the regular expression $r_1$ then we can iteratively apply $der$ |
385 the regular expression $r_1$ then we can iteratively apply $\textit{der}$ |
379 as follows |
386 as follows |
380 |
387 |
381 \begin{center} |
388 \begin{center} |
382 \begin{tabular}{rll} |
389 \begin{tabular}{rll} |
383 Input: $r_1$, $abc$\medskip\\ |
390 Input: $r_1$, $abc$\medskip\\ |
384 Step 1: & build derivative of $a$ and $r_1$ & $(r_2 = der\,a\,r_1)$\smallskip\\ |
391 Step 1: & build derivative of $a$ and $r_1$ & $(r_2 = \textit{der}\,a\,r_1)$\smallskip\\ |
385 Step 2: & build derivative of $b$ and $r_2$ & $(r_3 = der\,b\,r_2)$\smallskip\\ |
392 Step 2: & build derivative of $b$ and $r_2$ & $(r_3 = \textit{der}\,b\,r_2)$\smallskip\\ |
386 Step 3: & build derivative of $c$ and $r_3$ & $(r_4 = der\,b\,r_3)$\smallskip\\ |
393 Step 3: & build derivative of $c$ and $r_3$ & $(r_4 = \textit{der}\,b\,r_3)$\smallskip\\ |
387 Step 4: & the string is exhausted; test & ($\textit{nullable}(r_4)$)\\ |
394 Step 4: & the string is exhausted; test & ($\textit{nullable}(r_4)$)\\ |
388 & whether $r_4$ can recognise the\\ |
395 & whether $r_4$ can recognise the\\ |
389 & empty string\smallskip\\ |
396 & empty string\smallskip\\ |
390 Output: & result of this test $\Rightarrow \textit{true} \,\text{or}\, \textit{false}$\\ |
397 Output: & result of this test $\Rightarrow \textit{true} \,\text{or}\, \textit{false}$\\ |
391 \end{tabular} |
398 \end{tabular} |
392 \end{center} |
399 \end{center} |
393 |
400 |
394 \noindent Again the operation $Der$ might help to rationalise |
401 \noindent Again the operation $\textit{Der}$ might help to rationalise |
395 this algorithm. We want to know whether $abc \in L(r_1)$. We |
402 this algorithm. We want to know whether $abc \in L(r_1)$. We |
396 do not know yet---but let us assume it is. Then $Der\,a\,L(r_1)$ |
403 do not know yet---but let us assume it is. Then $\textit{Der}\,a\,L(r_1)$ |
397 builds the set where all the strings not starting with $a$ are |
404 builds the set where all the strings not starting with $a$ are |
398 filtered out. Of the remaining strings, the $a$ is stripped |
405 filtered out. Of the remaining strings, the $a$ is stripped |
399 off. So we should still have $bc$ in the set. |
406 off. So we should still have $bc$ in the set. |
400 Then we continue with filtering out all strings not |
407 Then we continue with filtering out all strings not |
401 starting with $b$ and stripping off the $b$ from the remaining |
408 starting with $b$ and stripping off the $b$ from the remaining |
402 strings, that means we build $Der\,b\,(Der\,a\,(L(r_1)))$. |
409 strings, that means we build $\textit{Der}\,b\,(\textit{Der}\,a\,(L(r_1)))$. |
403 Finally we filter out all strings not starting with $c$ and |
410 Finally we filter out all strings not starting with $c$ and |
404 strip off $c$ from the remaining string. This is |
411 strip off $c$ from the remaining string. This is |
405 $Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$. Now if $abc$ was in the |
412 $\textit{Der}\,c\,(\textit{Der}\,b\,(\textit{Der}\,a\,(L(r_1))))$. Now if $abc$ was in the |
406 original set ($L(r_1)$), then in $Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$ |
413 original set ($L(r_1)$), then $\textit{Der}\,c\,(\textit{Der}\,b\,(\textit{Der}\,a\,(L(r_1))))$ |
407 must contain the empty string. If not, then $abc$ was not in the |
414 must contain the empty string. If not, then $abc$ was not in the |
408 language we started with. |
415 language we started with. |
409 |
416 |
410 Our matching algorithm using $der$ and $\textit{nullable}$ works |
417 Our matching algorithm using $\textit{der}$ and $\textit{nullable}$ works |
411 similarly, just using regular expression instead of sets. For |
418 similarly, just using regular expression instead of sets. In order to |
412 this we need to extend the notion of derivatives from single |
419 define our algorithm we need to extend the notion of derivatives from single |
413 characters to strings. This can be done using the following |
420 characters to strings. This can be done using the following |
414 function, taking a string and regular expression as input and |
421 function, taking a string and a regular expression as input and |
415 a regular expression as output. |
422 a regular expression as output. |
416 |
423 |
417 \begin{center} |
424 \begin{center} |
418 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {\hspace{-10mm}}l@ {}} |
425 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {\hspace{-10mm}}l@ {}} |
419 $\textit{ders}\, []\, r$ & $\dn$ & $r$ & \\ |
426 $\textit{ders}\, []\, r$ & $\dn$ & $r$ & \\ |
420 $\textit{ders}\, (c\!::\!s)\, r$ & $\dn$ & $\textit{ders}\,s\,(der\,c\,r)$ & \\ |
427 $\textit{ders}\, (c\!::\!s)\, r$ & $\dn$ & $\textit{ders}\,s\,(\textit{der}\,c\,r)$ & \\ |
421 \end{tabular} |
428 \end{tabular} |
422 \end{center} |
429 \end{center} |
423 |
430 |
424 \noindent This function iterates $der$ taking one character at |
431 \noindent This function iterates $\textit{der}$ taking one character at |
425 the time from the original string until it is exhausted. |
432 the time from the original string until it is exhausted. |
426 Having $ders$ in place, we can finally define our matching |
433 Having $\textit{der}s$ in place, we can finally define our matching |
427 algorithm: |
434 algorithm: |
428 |
435 |
429 \[ |
436 \[ |
430 matches\,s\,r \dn \textit{nullable}(ders\,s\,r) |
437 \textit{matches}\,s\,r \dn \textit{nullable}(\textit{ders}\,s\,r) |
431 \] |
438 \] |
432 |
439 |
433 \noindent |
440 \noindent |
434 and we can claim that |
441 and we can claim that |
435 |
442 |
436 \[ |
443 \[ |
437 matches\,s\,r\quad\text{if and only if}\quad s\in L(r) |
444 \textit{matches}\,s\,r\quad\text{if and only if}\quad s\in L(r) |
438 \] |
445 \] |
439 |
446 |
440 \noindent holds, which means our algorithm satisfies the |
447 \noindent holds, which means our algorithm satisfies the |
441 specification. Of course we can claim many things\ldots |
448 specification. Of course we can claim many things\ldots |
442 whether the claim holds any water is a different question, |
449 whether the claim holds any water is a different question, |
443 which for example is the point of the Strand-2 Coursework. |
450 which for example is the point of the Strand-2 Coursework. |
444 |
451 |
445 This algorithm was introduced by Janus Brzozowski in 1964. Its |
452 This algorithm was introduced by Janus Brzozowski in 1964, but |
|
453 is more widely known only in the last 10 or so years. Its |
446 main attractions are simplicity and being fast, as well as |
454 main attractions are simplicity and being fast, as well as |
447 being easily extendable for other regular expressions such as |
455 being easily extendable for other regular expressions such as |
448 $r^{\{n\}}$, $r^?$, $\sim{}r$ and so on (this is subject of |
456 $r^{\{n\}}$, $r^?$, $\sim{}r$ and so on (this is subject of |
449 Strand-1 Coursework 1). |
457 Strand-1 Coursework 1). |
450 |
458 |
451 \subsection*{The Matching Algorithm in Scala} |
459 \subsection*{The Matching Algorithm in Scala} |
452 |
460 |
453 Another attraction of the algorithm is that it can be easily |
461 Another attraction of the algorithm is that it can be easily |
454 implemented in a functional programming language, like Scala. |
462 implemented in a functional programming language, like Scala. |
564 \end{tikzpicture} |
612 \end{tikzpicture} |
565 \end{center} |
613 \end{center} |
566 |
614 |
567 \noindent Now we are talking business! The modified matcher |
615 \noindent Now we are talking business! The modified matcher |
568 can within 30 seconds handle regular expressions up to |
616 can within 30 seconds handle regular expressions up to |
569 $n = 950$ before a StackOverflow is raised. Python and Ruby |
617 $n = 950$ before a StackOverflow is raised. Recall that Python and Ruby |
570 (and our first version) could only handle $n = 27$ or so in 30 |
618 (and our first version, Scala V1) could only handle $n = 27$ or so in 30 |
571 second. |
619 seconds. There is no change for our second example |
572 |
620 $(a^*)^* \cdot b$---so this is still good. |
573 SECOND EXAMPLE |
621 |
574 |
622 |
575 The moral is that our algorithm is rather sensitive to the |
623 The moral is that our algorithm is rather sensitive to the |
576 size of regular expressions it needs to handle. This is of |
624 size of regular expressions it needs to handle. This is of |
577 course obvious because both $\textit{nullable}$ and $der$ frequently |
625 course obvious because both $\textit{nullable}$ and $\textit{der}$ frequently |
578 need to traverse the whole regular expression. There seems, |
626 need to traverse the whole regular expression. There seems, |
579 however, one more issue for making the algorithm run faster. |
627 however, one more issue for making the algorithm run faster. |
580 The derivative function often produces ``useless'' |
628 The derivative function often produces ``useless'' |
581 $\ZERO$s and $\ONE$s. To see this, consider $r = ((a |
629 $\ZERO$s and $\ONE$s. To see this, consider $r = ((a |
582 \cdot b) + b)^*$ and the following two derivatives |
630 \cdot b) + b)^*$ and the following two derivatives |
583 |
631 |
584 \begin{center} |
632 \begin{center} |
585 \begin{tabular}{l} |
633 \begin{tabular}{l} |
586 $der\,a\,r = ((\ONE \cdot b) + \ZERO) \cdot r$\\ |
634 $\textit{der}\,a\,r = ((\ONE \cdot b) + \ZERO) \cdot r$\\ |
587 $der\,b\,r = ((\ZERO \cdot b) + \ONE)\cdot r$\\ |
635 $\textit{der}\,b\,r = ((\ZERO \cdot b) + \ONE)\cdot r$\\ |
588 $der\,c\,r = ((\ZERO \cdot b) + \ZERO)\cdot r$ |
636 $\textit{der}\,c\,r = ((\ZERO \cdot b) + \ZERO)\cdot r$ |
589 \end{tabular} |
637 \end{tabular} |
590 \end{center} |
638 \end{center} |
591 |
639 |
592 \noindent |
640 \noindent |
593 If we simplify them according to the simple rules from the |
641 If we simplify them according to the simple rules from the |
594 beginning, we can replace the right-hand sides by the |
642 beginning, we can replace the right-hand sides by the |
595 smaller equivalent regular expressions |
643 smaller equivalent regular expressions |
596 |
644 |
597 \begin{center} |
645 \begin{center} |
598 \begin{tabular}{l} |
646 \begin{tabular}{l} |
599 $der\,a\,r \equiv b \cdot r$\\ |
647 $\textit{der}\,a\,r \equiv b \cdot r$\\ |
600 $der\,b\,r \equiv r$\\ |
648 $\textit{der}\,b\,r \equiv r$\\ |
601 $der\,c\,r \equiv \ZERO$ |
649 $\textit{der}\,c\,r \equiv \ZERO$ |
602 \end{tabular} |
650 \end{tabular} |
603 \end{center} |
651 \end{center} |
604 |
652 |
605 \noindent I leave it to you to contemplate whether such a |
653 \noindent I leave it to you to contemplate whether such a |
606 simplification can have any impact on the correctness of our |
654 simplification can have any impact on the correctness of our |