12 This lecture is about implementing a more efficient regular |
12 This lecture is about implementing a more efficient regular |
13 expression matcher (the plots on the right)---more efficient |
13 expression matcher (the plots on the right)---more efficient |
14 than the matchers from regular expression libraries in Ruby |
14 than the matchers from regular expression libraries in Ruby |
15 and Python (the plots on the left). These plots show the |
15 and Python (the plots on the left). These plots show the |
16 running time for the evil regular expression |
16 running time for the evil regular expression |
17 $a?^{\{n\}}a^{\{n\}}$ and string composed of $n$ \pcode{a}s. |
17 $a^?^{\{n\}}\cdot a^{\{n\}}$ and strings composed of $n$ \pcode{a}s. |
18 We will use this regular expression and strings as running |
18 We will use this regular expression and strings as running |
19 example. To see the substantial differences in the two plots |
19 example. To see the substantial differences in the two plots |
20 below, note the different scales of the $x$-axes. |
20 below, note the different scales of the $x$-axes. |
21 |
21 |
22 |
22 |
156 easy to verify since $L(\epsilon) = \{[]\}$ and appending the |
156 easy to verify since $L(\epsilon) = \{[]\}$ and appending the |
157 empty string to every string of another set, leaves the set |
157 empty string to every string of another set, leaves the set |
158 unchanged. Be careful to fully comprehend the fifth and sixth |
158 unchanged. Be careful to fully comprehend the fifth and sixth |
159 equivalence: if you concatenate two sets of strings and one is |
159 equivalence: if you concatenate two sets of strings and one is |
160 the empty set, then the concatenation will also be the empty |
160 the empty set, then the concatenation will also be the empty |
161 set. To see this, check the definition of \pcode{_ @ _}. The |
161 set. To see this, check the definition of $\_ @ \_$. The |
162 last equivalence is again trivial. |
162 last equivalence is again trivial. |
163 |
163 |
164 What will be important later on is that we can orient these |
164 What will be important later on is that we can orient these |
165 equivalences and read them from left to right. In this way we |
165 equivalences and read them from left to right. In this way we |
166 can view them as \emph{simplification rules}. Consider for |
166 can view them as \emph{simplification rules}. Consider for |
298 |
298 |
299 \noindent This operation essentially transforms a set of |
299 \noindent This operation essentially transforms a set of |
300 strings $A$ by filtering out all strings that do not start |
300 strings $A$ by filtering out all strings that do not start |
301 with $c$ and then strips off the $c$ from all the remaining |
301 with $c$ and then strips off the $c$ from all the remaining |
302 strings. For example suppose $A = \{f\!oo, bar, f\!rak\}$ then |
302 strings. For example suppose $A = \{f\!oo, bar, f\!rak\}$ then |
303 \[ Der\,f\,A = \{oo, rak\}\quad,\quad Der\,b\,A = \{ar\} \quad |
303 |
304 \text{and} \quad Der\,a\,A = \varnothing \] |
304 \[ Der\,f\,A = \{oo, rak\}\quad,\quad |
|
305 Der\,b\,A = \{ar\} \quad \text{and} \quad |
|
306 Der\,a\,A = \{\} |
|
307 \] |
305 |
308 |
306 \noindent |
309 \noindent |
307 Note that in the last case $Der$ is empty, because no string in $A$ |
310 Note that in the last case $Der$ is empty, because no string in $A$ |
308 starts with $a$. With this operation we can state the following |
311 starts with $a$. With this operation we can state the following |
309 property about $der$: |
312 property about $der$: |
573 xmax=12000, |
579 xmax=12000, |
574 ytick={0,5,...,30}, |
580 ytick={0,5,...,30}, |
575 scaled ticks=false, |
581 scaled ticks=false, |
576 axis lines=left, |
582 axis lines=left, |
577 width=9cm, |
583 width=9cm, |
578 height=4cm, |
584 height=5cm, |
579 legend entries={Scala V2,Scala V3}] |
585 legend entries={Scala V2,Scala V3}] |
580 \addplot[green,mark=square*,mark options={fill=white}] table {re2b.data}; |
586 \addplot[green,mark=square*,mark options={fill=white}] table {re2b.data}; |
581 \addplot[black,mark=square*,mark options={fill=white}] table {re3.data}; |
587 \addplot[black,mark=square*,mark options={fill=white}] table {re3.data}; |
582 \end{axis} |
588 \end{axis} |
583 \end{tikzpicture} |
589 \end{tikzpicture} |
584 \end{center} |
590 \end{center} |
585 |
591 |
586 \section*{Proofs} |
592 \section*{Proofs} |
587 |
593 |
588 You might not like doing proofs. But they serve a very |
594 You might not like doing proofs. But they serve a very |
589 important purpose in Computer Science: When can we be sure |
595 important purpose in Computer Science: How can we be sure that |
590 that our algorithms match their specification. We can try to |
596 our algorithm matches its specification. We can try to test |
591 test the algorithms, but that often overlooks corner cases and |
597 the algorithm, but that often overlooks corner cases and an |
592 also often an exhaustive testing is impossible (since there |
598 exhaustive testing is impossible (since there are infinitely |
593 are often infinitely many inputs). Proofs allow us to ensure |
599 many inputs). Proofs allow us to ensure that an algorithm |
594 that an algorithm meets its specification. |
600 really meets its specification. |
595 |
601 |
596 For the programs we look at in this module, the proofs will |
602 For the programs we look at in this module, the proofs will |
597 mostly by some form of induction. Remember that regular |
603 mostly by some form of induction. Remember that regular |
598 expressions are defined as |
604 expressions are defined as |
599 |
605 |
640 nullable(\varnothing) \;\;\text{if and only if}\;\; |
647 nullable(\varnothing) \;\;\text{if and only if}\;\; |
641 []\in L(\varnothing) |
648 []\in L(\varnothing) |
642 \] |
649 \] |
643 |
650 |
644 \noindent whereby $nullable(\varnothing)$ is by definition of |
651 \noindent whereby $nullable(\varnothing)$ is by definition of |
645 the function $nullable$ always $\textit{false}$. We also |
652 the function $nullable$ always $\textit{false}$. We also have |
646 have that $L(\varnothing)$ is by definition $\{\}$. It is |
653 that $L(\varnothing)$ is by definition $\{\}$. It is |
647 impossible that the empty string $[]$ is in the empty set. |
654 impossible that the empty string $[]$ is in the empty set. |
648 Therefore also the right-hand side is false. Consequently we |
655 Therefore also the right-hand side is false. Consequently we |
649 verified this case. We would still need to do this for |
656 verified this case: both sides are false. We would still need |
650 $P(\varepsilon)$ and $P(c)$. I leave this to you to check. |
657 to do this for $P(\varepsilon)$ and $P(c)$. I leave this to |
|
658 you to verify. |
651 |
659 |
652 Next we need to check the inductive cases, for example |
660 Next we need to check the inductive cases, for example |
653 $P(r_1 + r_2)$, which is |
661 $P(r_1 + r_2)$, which is |
654 |
662 |
655 \[ |
663 \begin{equation} |
656 nullable(r_1 + r_2) \;\;\text{if and only if}\;\; |
664 nullable(r_1 + r_2) \;\;\text{if and only if}\;\; |
657 []\in L(r_1 + r_2) |
665 []\in L(r_1 + r_2) |
658 \] |
666 \label{propalt} |
|
667 \end{equation} |
659 |
668 |
660 \noindent The difference to the base cases is that in this |
669 \noindent The difference to the base cases is that in this |
661 case we can already assume we proved |
670 case we can already assume we proved |
662 |
671 |
663 \begin{center} |
672 \begin{center} |
664 \begin{tabular}{l} |
673 \begin{tabular}{l} |
665 $nullable(r_1) \;\;\text{if and only if}\;\; []\in L(r_1)$\\ |
674 $nullable(r_1) \;\;\text{if and only if}\;\; []\in L(r_1)$ and\\ |
666 $nullable(r_2) \;\;\text{if and only if}\;\; []\in L(r_2)$\\ |
675 $nullable(r_2) \;\;\text{if and only if}\;\; []\in L(r_2)$\\ |
667 \end{tabular} |
676 \end{tabular} |
668 \end{center} |
677 \end{center} |
669 |
678 |
670 \noindent These are the induction hypotheses. To check this |
679 \noindent These are the induction hypotheses. To check this |
671 case we can start from $nullable(r_1 + r_2)$, which by |
680 case, we can start from $nullable(r_1 + r_2)$, which by |
672 definition is |
681 definition is |
673 |
682 |
674 \[ |
683 \[ |
675 nullable(r_1) \vee nullable(r_2) |
684 nullable(r_1) \vee nullable(r_2) |
676 \] |
685 \] |
677 |
686 |
678 \noindent Using the induction hypotheses we can transform |
687 \noindent Using the two induction hypotheses from above, |
679 this into |
688 we can transform this into |
680 |
689 |
681 \[ |
690 \[ |
682 [] \in L(r_1) \vee []\in(r_2) |
691 [] \in L(r_1) \vee []\in(r_2) |
683 \] |
692 \] |
684 |
693 |
685 \noindent We just replaced the $nullable(\ldots)$ parts by |
694 \noindent We just replaced the $nullable(\ldots)$ parts by |
686 the equivalent $[] \in L(\ldots)$ from the induction |
695 the equivalent $[] \in L(\ldots)$ from the induction |
687 hypotheses. A bit of thinking convinces you that if |
696 hypotheses. A bit of thinking convinces you that if |
688 $[] \in L(r_1) \vee []\in(r_2)$ then the empty string |
697 $[] \in L(r_1) \vee []\in L(r_2)$ then the empty string |
689 must be in the union $L(r_1)\cup L(r_2)$, that is |
698 must be in the union $L(r_1)\cup L(r_2)$, that is |
690 |
699 |
691 \[ |
700 \[ |
692 [] \in L(r_1)\cup L(r_2) |
701 [] \in L(r_1)\cup L(r_2) |
693 \] |
702 \] |
694 |
703 |
695 \noindent but this is by definition of $L$ exactly |
704 \noindent but this is by definition of $L$ exactly $[] \in |
696 $[] \in L(r_1 + r_2)$, which we needed to establish. |
705 L(r_1 + r_2)$, which we needed to establish according to |
697 What we have shown is that starting from |
706 \eqref{propalt}. What we have shown is that starting from |
698 $nullable(r_1 + r_2)$ we have done equivalent transformations |
707 $nullable(r_1 + r_2)$ we have done equivalent transformations |
699 to end up with $[] \in L(r_1 + r_2)$. Consequently we have |
708 to end up with $[] \in L(r_1 + r_2)$. Consequently we have |
700 established that $P(r_1 + r_2)$ holds. |
709 established that $P(r_1 + r_2)$ holds. |
701 |
710 |
702 In order to complete the proof we would now need to look |
711 In order to complete the proof we would now need to look |
703 at the cases $P(r_1\cdot r_2)$ and $P(r^*)$. Again I let you |
712 at the cases \mbox{$P(r_1\cdot r_2)$} and $P(r^*)$. Again I let you |
704 check the details. |
713 check the details. |
705 |
714 |
706 Finally, you might have to do induction proofs over strings. |
715 You might have to do induction proofs over strings. |
707 That means you want to establish a property $P(s)$ for all |
716 That means you want to establish a property $P(s)$ for all |
708 strings $s$. For this remember strings are lists of |
717 strings $s$. For this remember strings are lists of |
709 characters. These lists can be either the empty list or a |
718 characters. These lists can be either the empty list or a |
710 list of the form $c::s$. If you want to perform an induction |
719 list of the form $c::s$. If you want to perform an induction |
711 proof for strings you need to consider the cases |
720 proof for strings you need to consider the cases |
717 \end{itemize} |
726 \end{itemize} |
718 |
727 |
719 \noindent |
728 \noindent |
720 Given this recipe, I let you show |
729 Given this recipe, I let you show |
721 |
730 |
722 \[ |
731 \begin{equation} |
723 Ders\,s\,(L(r)) = L(ders\,s\,r) |
732 Ders\,s\,(L(r)) = L(ders\,s\,r) |
724 \] |
733 \label{dersprop} |
725 |
734 \end{equation} |
726 \noindent by induction on $s$. In this proof you can |
735 |
727 assume the property for $der$ and $Der$ has already been |
736 \noindent by induction on $s$. In this proof you can assume |
|
737 the following property for $der$ and $Der$ has already been |
728 proved, that is you can assume |
738 proved, that is you can assume |
729 |
739 |
730 \[ |
740 \[ |
731 L(der\,c\,r) = Der\,c\,(L(r)) |
741 L(der\,c\,r) = Der\,c\,(L(r)) |
732 \] |
742 \] |
733 |
743 |
734 \noindent holds (this would be of course a property that |
744 \noindent holds (this would be of course a property that |
735 needs to be proved in a side-lemma by induction on $r$). |
745 needs to be proved in a side-lemma by induction on $r$). |
|
746 |
|
747 To sum up, using reasoning like the one shown above allows us |
|
748 to show the correctness of our algorithm. To see this, |
|
749 start from the specification |
|
750 |
|
751 \[ |
|
752 s \in L(r) |
|
753 \] |
|
754 |
|
755 \noindent That is the problem we want to solve. Thinking a |
|
756 little, you will see that this problem is equivalent to the |
|
757 following problem |
|
758 |
|
759 \begin{equation} |
|
760 [] \in Ders\,s\,(L(r)) |
|
761 \label{dersstep} |
|
762 \end{equation} |
|
763 |
|
764 \noindent But we have shown above in \eqref{dersprop}, that |
|
765 the $Ders$ can be replaced by $L(ders\ldots)$. That means |
|
766 \eqref{dersstep} is equivalent to |
|
767 |
|
768 \begin{equation} |
|
769 [] \in L(ders\,s\,r) |
|
770 \label{prefinalstep} |
|
771 \end{equation} |
|
772 |
|
773 \noindent We have also shown that testing whether the empty |
|
774 string is in a language is equivalent to the $nullable$ |
|
775 function; see \eqref{nullableprop}. That means |
|
776 \eqref{prefinalstep} is equivalent with |
|
777 |
|
778 \[ |
|
779 nullable(ders\,s\,r) |
|
780 \] |
|
781 |
|
782 \noindent But this is just the definition of $matches$ |
|
783 |
|
784 \[ |
|
785 matches\,s\,r \dn nullable(ders\,s\,r) |
|
786 \] |
|
787 |
|
788 \noindent In effect we have shown |
|
789 |
|
790 \[ |
|
791 matches\,s\,r\;\;\text{if and only if}\;\; |
|
792 s\in L(r) |
|
793 \] |
|
794 |
|
795 \noindent which is the property we set out to prove: |
|
796 our algorithm meets its specification. To have done |
|
797 so, requires a few induction proofs about strings and |
|
798 regular expressions. Following the recipes is already a big |
|
799 step in performing these proofs. |
736 |
800 |
737 \end{document} |
801 \end{document} |
738 |
802 |
739 |
803 |
740 |
804 |