handouts/ho02.tex
changeset 343 539b2e88f5b9
parent 340 c49122dbcdd1
child 394 2f9fe225ecc8
equal deleted inserted replaced
342:c235e0aeb8df 343:539b2e88f5b9
    12 This lecture is about implementing a more efficient regular
    12 This lecture is about implementing a more efficient regular
    13 expression matcher (the plots on the right)---more efficient
    13 expression matcher (the plots on the right)---more efficient
    14 than the matchers from regular expression libraries in Ruby
    14 than the matchers from regular expression libraries in Ruby
    15 and Python (the plots on the left). These plots show the
    15 and Python (the plots on the left). These plots show the
    16 running time for the evil regular expression
    16 running time for the evil regular expression
    17 $a?^{\{n\}}a^{\{n\}}$ and string composed of $n$ \pcode{a}s.
    17 $a^?^{\{n\}}\cdot a^{\{n\}}$ and strings composed of $n$ \pcode{a}s.
    18 We will use this regular expression and strings as running
    18 We will use this regular expression and strings as running
    19 example. To see the substantial differences in the two plots
    19 example. To see the substantial differences in the two plots
    20 below, note the different scales of the $x$-axes. 
    20 below, note the different scales of the $x$-axes. 
    21 
    21 
    22 
    22 
   156 easy to verify since $L(\epsilon) = \{[]\}$ and appending the
   156 easy to verify since $L(\epsilon) = \{[]\}$ and appending the
   157 empty string to every string of another set, leaves the set
   157 empty string to every string of another set, leaves the set
   158 unchanged. Be careful to fully comprehend the fifth and sixth
   158 unchanged. Be careful to fully comprehend the fifth and sixth
   159 equivalence: if you concatenate two sets of strings and one is
   159 equivalence: if you concatenate two sets of strings and one is
   160 the empty set, then the concatenation will also be the empty
   160 the empty set, then the concatenation will also be the empty
   161 set. To see this, check the definition of \pcode{_ @ _}. The
   161 set. To see this, check the definition of $\_ @ \_$. The
   162 last equivalence is again trivial.
   162 last equivalence is again trivial.
   163 
   163 
   164 What will be important later on is that we can orient these
   164 What will be important later on is that we can orient these
   165 equivalences and read them from left to right. In this way we
   165 equivalences and read them from left to right. In this way we
   166 can view them as \emph{simplification rules}. Consider for 
   166 can view them as \emph{simplification rules}. Consider for 
   174 \noindent If we can find an equivalent regular expression that
   174 \noindent If we can find an equivalent regular expression that
   175 is simpler (smaller for example), then this might potentially
   175 is simpler (smaller for example), then this might potentially
   176 make our matching algorithm run faster. The reason is that
   176 make our matching algorithm run faster. The reason is that
   177 whether a string $s$ is in $L(r)$ or in $L(r')$ with $r\equiv
   177 whether a string $s$ is in $L(r)$ or in $L(r')$ with $r\equiv
   178 r'$ will always give the same answer. In the example above you
   178 r'$ will always give the same answer. In the example above you
   179 will see that the regular expression is equivalent to $r_1$.
   179 will see that the regular expression is equivalent to just $r_1$.
   180 You can verify this by iteratively applying the simplification
   180 You can verify this by iteratively applying the simplification
   181 rules from above:
   181 rules from above:
   182 
   182 
   183 \begin{center}
   183 \begin{center}
   184 \begin{tabular}{ll}
   184 \begin{tabular}{ll}
   298 
   298 
   299 \noindent This operation essentially transforms a set of
   299 \noindent This operation essentially transforms a set of
   300 strings $A$ by filtering out all strings that do not start
   300 strings $A$ by filtering out all strings that do not start
   301 with $c$ and then strips off the $c$ from all the remaining
   301 with $c$ and then strips off the $c$ from all the remaining
   302 strings. For example suppose $A = \{f\!oo, bar, f\!rak\}$ then
   302 strings. For example suppose $A = \{f\!oo, bar, f\!rak\}$ then
   303 \[ Der\,f\,A = \{oo, rak\}\quad,\quad Der\,b\,A = \{ar\} \quad
   303 
   304 \text{and} \quad Der\,a\,A = \varnothing \]
   304 \[ Der\,f\,A = \{oo, rak\}\quad,\quad 
       
   305    Der\,b\,A = \{ar\} \quad \text{and} \quad 
       
   306    Der\,a\,A = \{\} 
       
   307 \]
   305  
   308  
   306 \noindent
   309 \noindent
   307 Note that in the last case $Der$ is empty, because no string in $A$
   310 Note that in the last case $Der$ is empty, because no string in $A$
   308 starts with $a$. With this operation we can state the following
   311 starts with $a$. With this operation we can state the following
   309 property about $der$:
   312 property about $der$:
   408   matching and recursion allow us to mimic the mathematical
   411   matching and recursion allow us to mimic the mathematical
   409   definitions very closely.\label{scala1}}
   412   definitions very closely.\label{scala1}}
   410 \end{figure}
   413 \end{figure}
   411 
   414 
   412 For running the algorithm with our favourite example, the evil
   415 For running the algorithm with our favourite example, the evil
   413 regular expression $a?^{\{n\}}a^{\{n\}}$, we need to implement
   416 regular expression $a^?^{\{n\}}a^{\{n\}}$, we need to implement
   414 the optional regular expression and the exactly $n$-times
   417 the optional regular expression and the exactly $n$-times
   415 regular expression. This can be done with the translations
   418 regular expression. This can be done with the translations
   416 
   419 
   417 \lstinputlisting[numbers=none]{../progs/app51.scala}
   420 \lstinputlisting[numbers=none]{../progs/app51.scala}
   418 
   421 
   462 \end{center}
   465 \end{center}
   463 
   466 
   464 \noindent Our algorithm traverses such regular expressions at
   467 \noindent Our algorithm traverses such regular expressions at
   465 least once every time a derivative is calculated. So having
   468 least once every time a derivative is calculated. So having
   466 large regular expressions will cause problems. This problem
   469 large regular expressions will cause problems. This problem
   467 is aggravated by $a?$ being represented as $a + \epsilon$.
   470 is aggravated by $a^?$ being represented as $a + \epsilon$.
   468 
   471 
   469 We can however fix this by having an explicit constructor for
   472 We can however fix this by having an explicit constructor for
   470 $r^{\{n\}}$. In Scala we would introduce a constructor like
   473 $r^{\{n\}}$. In Scala we would introduce a constructor like
   471 
   474 
   472 \begin{center}
   475 \begin{center}
   508 \end{tikzpicture}
   511 \end{tikzpicture}
   509 \end{center}
   512 \end{center}
   510 
   513 
   511 \noindent Now we are talking business! The modified matcher 
   514 \noindent Now we are talking business! The modified matcher 
   512 can within 30 seconds handle regular expressions up to
   515 can within 30 seconds handle regular expressions up to
   513 $n = 950$ before a StackOverflow is raised.
   516 $n = 950$ before a StackOverflow is raised. Python and Ruby 
       
   517 (and our first version) could only handle $n = 27$ or so in 30 
       
   518 second.
   514 
   519 
   515 The moral is that our algorithm is rather sensitive to the
   520 The moral is that our algorithm is rather sensitive to the
   516 size of regular expressions it needs to handle. This is of
   521 size of regular expressions it needs to handle. This is of
   517 course obvious because both $nullable$ and $der$ frequently
   522 course obvious because both $nullable$ and $der$ frequently
   518 need to traverse the whole regular expression. There seems,
   523 need to traverse the whole regular expression. There seems,
   559 \begin{figure}[p]
   564 \begin{figure}[p]
   560 \lstinputlisting{../progs/app6.scala}
   565 \lstinputlisting{../progs/app6.scala}
   561 \caption{The simplification function and modified 
   566 \caption{The simplification function and modified 
   562 \texttt{ders}-function; this function now
   567 \texttt{ders}-function; this function now
   563 calls \texttt{der} first, but then simplifies
   568 calls \texttt{der} first, but then simplifies
   564 the resulting derivative regular expressions, see
   569 the resulting derivative regular expressions before
       
   570 building the next derivative, see
   565 Line~28.\label{scala2}}
   571 Line~28.\label{scala2}}
   566 \end{figure}
   572 \end{figure}
   567 
   573 
   568 \begin{center}
   574 \begin{center}
   569 \begin{tikzpicture}
   575 \begin{tikzpicture}
   573     xmax=12000,
   579     xmax=12000,
   574     ytick={0,5,...,30},
   580     ytick={0,5,...,30},
   575     scaled ticks=false,
   581     scaled ticks=false,
   576     axis lines=left,
   582     axis lines=left,
   577     width=9cm,
   583     width=9cm,
   578     height=4cm, 
   584     height=5cm, 
   579     legend entries={Scala V2,Scala V3}]
   585     legend entries={Scala V2,Scala V3}]
   580 \addplot[green,mark=square*,mark options={fill=white}] table {re2b.data};
   586 \addplot[green,mark=square*,mark options={fill=white}] table {re2b.data};
   581 \addplot[black,mark=square*,mark options={fill=white}] table {re3.data};
   587 \addplot[black,mark=square*,mark options={fill=white}] table {re3.data};
   582 \end{axis}
   588 \end{axis}
   583 \end{tikzpicture}
   589 \end{tikzpicture}
   584 \end{center}
   590 \end{center}
   585 
   591 
   586 \section*{Proofs}
   592 \section*{Proofs}
   587 
   593 
   588 You might not like doing proofs. But they serve a very
   594 You might not like doing proofs. But they serve a very
   589 important purpose in Computer Science: When can we be sure
   595 important purpose in Computer Science: How can we be sure that
   590 that our algorithms match their specification. We can try to
   596 our algorithm matches its specification. We can try to test
   591 test the algorithms, but that often overlooks corner cases and
   597 the algorithm, but that often overlooks corner cases and an
   592 also often an exhaustive testing is impossible (since there
   598 exhaustive testing is impossible (since there are infinitely
   593 are often infinitely many inputs). Proofs allow us to ensure
   599 many inputs). Proofs allow us to ensure that an algorithm
   594 that an algorithm meets its specification. 
   600 really meets its specification. 
   595 
   601 
   596 For the programs we look at in this module, the proofs will
   602 For the programs we look at in this module, the proofs will
   597 mostly by some form of induction. Remember that regular
   603 mostly by some form of induction. Remember that regular
   598 expressions are defined as 
   604 expressions are defined as 
   599 
   605 
   625 
   631 
   626 \noindent 
   632 \noindent 
   627 A simple proof is for example showing the following 
   633 A simple proof is for example showing the following 
   628 property:
   634 property:
   629 
   635 
   630 \[
   636 \begin{equation}
   631 nullable(r) \;\;\text{if and only if}\;\; []\in L(r)
   637 nullable(r) \;\;\text{if and only if}\;\; []\in L(r)
   632 \]
   638 \label{nullableprop}
       
   639 \end{equation}
   633 
   640 
   634 \noindent
   641 \noindent
   635 Let us say that this property is $P(r)$, then the first case
   642 Let us say that this property is $P(r)$, then the first case
   636 we need to check is whether $P(\varnothing)$ (see recipe 
   643 we need to check is whether $P(\varnothing)$ (see recipe 
   637 above). So we have to show that
   644 above). So we have to show that
   640 nullable(\varnothing) \;\;\text{if and only if}\;\; 
   647 nullable(\varnothing) \;\;\text{if and only if}\;\; 
   641 []\in L(\varnothing)
   648 []\in L(\varnothing)
   642 \]
   649 \]
   643 
   650 
   644 \noindent whereby $nullable(\varnothing)$ is by definition of
   651 \noindent whereby $nullable(\varnothing)$ is by definition of
   645 the function $nullable$ always $\textit{false}$. We also
   652 the function $nullable$ always $\textit{false}$. We also have
   646 have that $L(\varnothing)$ is by definition $\{\}$. It is 
   653 that $L(\varnothing)$ is by definition $\{\}$. It is
   647 impossible that the empty string $[]$ is in the empty set. 
   654 impossible that the empty string $[]$ is in the empty set.
   648 Therefore also the right-hand side is false. Consequently we
   655 Therefore also the right-hand side is false. Consequently we
   649 verified this case. We would still need to do this for 
   656 verified this case: both sides are false. We would still need
   650 $P(\varepsilon)$ and $P(c)$. I leave this to you to check.
   657 to do this for $P(\varepsilon)$ and $P(c)$. I leave this to
       
   658 you to verify.
   651 
   659 
   652 Next we need to check the inductive cases, for example
   660 Next we need to check the inductive cases, for example
   653 $P(r_1 + r_2)$, which is
   661 $P(r_1 + r_2)$, which is
   654 
   662 
   655 \[
   663 \begin{equation}
   656 nullable(r_1 + r_2) \;\;\text{if and only if}\;\; 
   664 nullable(r_1 + r_2) \;\;\text{if and only if}\;\; 
   657 []\in L(r_1 + r_2)
   665 []\in L(r_1 + r_2)
   658 \]
   666 \label{propalt}
       
   667 \end{equation}
   659 
   668 
   660 \noindent The difference to the base cases is that in this
   669 \noindent The difference to the base cases is that in this
   661 case we can already assume we proved
   670 case we can already assume we proved
   662 
   671 
   663 \begin{center}
   672 \begin{center}
   664 \begin{tabular}{l}
   673 \begin{tabular}{l}
   665 $nullable(r_1) \;\;\text{if and only if}\;\; []\in L(r_1)$\\
   674 $nullable(r_1) \;\;\text{if and only if}\;\; []\in L(r_1)$ and\\
   666 $nullable(r_2) \;\;\text{if and only if}\;\; []\in L(r_2)$\\
   675 $nullable(r_2) \;\;\text{if and only if}\;\; []\in L(r_2)$\\
   667 \end{tabular}
   676 \end{tabular}
   668 \end{center}
   677 \end{center}
   669 
   678 
   670 \noindent These are the induction hypotheses. To check this 
   679 \noindent These are the induction hypotheses. To check this 
   671 case we can start from $nullable(r_1 + r_2)$, which by 
   680 case, we can start from $nullable(r_1 + r_2)$, which by 
   672 definition is
   681 definition is
   673 
   682 
   674 \[
   683 \[
   675 nullable(r_1) \vee nullable(r_2)
   684 nullable(r_1) \vee nullable(r_2)
   676 \]
   685 \]
   677 
   686 
   678 \noindent Using the induction hypotheses we can transform
   687 \noindent Using the two induction hypotheses from above,
   679 this into 
   688 we can transform this into 
   680 
   689 
   681 \[
   690 \[
   682 [] \in L(r_1) \vee []\in(r_2)
   691 [] \in L(r_1) \vee []\in(r_2)
   683 \]
   692 \]
   684 
   693 
   685 \noindent We just replaced the $nullable(\ldots)$ parts by
   694 \noindent We just replaced the $nullable(\ldots)$ parts by
   686 the equivalent $[] \in L(\ldots)$ from the induction 
   695 the equivalent $[] \in L(\ldots)$ from the induction 
   687 hypotheses. A bit of thinking convinces you that if
   696 hypotheses. A bit of thinking convinces you that if
   688 $[] \in L(r_1) \vee []\in(r_2)$ then the empty string
   697 $[] \in L(r_1) \vee []\in L(r_2)$ then the empty string
   689 must be in the union $L(r_1)\cup L(r_2)$, that is
   698 must be in the union $L(r_1)\cup L(r_2)$, that is
   690 
   699 
   691 \[
   700 \[
   692 [] \in L(r_1)\cup L(r_2)
   701 [] \in L(r_1)\cup L(r_2)
   693 \]
   702 \]
   694 
   703 
   695 \noindent but this is by definition of $L$ exactly
   704 \noindent but this is by definition of $L$ exactly $[] \in
   696 $[] \in L(r_1 + r_2)$, which we needed to establish.
   705 L(r_1 + r_2)$, which we needed to establish according to
   697 What we have shown is that starting from 
   706 \eqref{propalt}. What we have shown is that starting from
   698 $nullable(r_1 + r_2)$ we have done equivalent transformations
   707 $nullable(r_1 + r_2)$ we have done equivalent transformations
   699 to end up with $[] \in L(r_1 + r_2)$. Consequently we have 
   708 to end up with $[] \in L(r_1 + r_2)$. Consequently we have
   700 established that $P(r_1 + r_2)$ holds.
   709 established that $P(r_1 + r_2)$ holds.
   701 
   710 
   702 In order to complete the proof we would now need to look 
   711 In order to complete the proof we would now need to look 
   703 at the cases $P(r_1\cdot r_2)$ and $P(r^*)$. Again I let you
   712 at the cases \mbox{$P(r_1\cdot r_2)$} and $P(r^*)$. Again I let you
   704 check the details.
   713 check the details.
   705 
   714 
   706 Finally, you might have to do induction proofs over strings. 
   715 You might have to do induction proofs over strings. 
   707 That means you want to establish a property $P(s)$ for all
   716 That means you want to establish a property $P(s)$ for all
   708 strings $s$. For this remember strings are lists of 
   717 strings $s$. For this remember strings are lists of 
   709 characters. These lists can be either the empty list or a
   718 characters. These lists can be either the empty list or a
   710 list of the form $c::s$. If you want to perform an induction
   719 list of the form $c::s$. If you want to perform an induction
   711 proof for strings you need to consider the cases
   720 proof for strings you need to consider the cases
   717 \end{itemize}
   726 \end{itemize}
   718 
   727 
   719 \noindent
   728 \noindent
   720 Given this recipe, I let you show
   729 Given this recipe, I let you show
   721 
   730 
   722 \[
   731 \begin{equation}
   723 Ders\,s\,(L(r)) = L(ders\,s\,r)
   732 Ders\,s\,(L(r)) = L(ders\,s\,r)
   724 \]
   733 \label{dersprop}
   725 
   734 \end{equation}
   726 \noindent by induction on $s$. In this proof you can
   735 
   727 assume the property for $der$ and $Der$ has already been
   736 \noindent by induction on $s$. In this proof you can assume
       
   737 the following property for $der$ and $Der$ has already been
   728 proved, that is you can assume
   738 proved, that is you can assume
   729 
   739 
   730 \[
   740 \[
   731 L(der\,c\,r) = Der\,c\,(L(r))
   741 L(der\,c\,r) = Der\,c\,(L(r))
   732 \]
   742 \]
   733 
   743 
   734 \noindent holds (this would be of course a property that
   744 \noindent holds (this would be of course a property that
   735 needs to be proved in a side-lemma by induction on $r$).
   745 needs to be proved in a side-lemma by induction on $r$).
       
   746 
       
   747 To sum up, using reasoning like the one shown above allows us 
       
   748 to show the correctness of our algorithm. To see this,
       
   749 start from the specification
       
   750 
       
   751 \[
       
   752 s \in L(r)
       
   753 \]
       
   754 
       
   755 \noindent That is the problem we want to solve. Thinking a 
       
   756 little, you will see that this problem is equivalent to the 
       
   757 following problem
       
   758 
       
   759 \begin{equation}
       
   760 [] \in Ders\,s\,(L(r))
       
   761 \label{dersstep}
       
   762 \end{equation}
       
   763 
       
   764 \noindent But we have shown above in \eqref{dersprop}, that
       
   765 the $Ders$ can be replaced by $L(ders\ldots)$. That means 
       
   766 \eqref{dersstep} is equivalent to 
       
   767 
       
   768 \begin{equation}
       
   769 [] \in L(ders\,s\,r)
       
   770 \label{prefinalstep}
       
   771 \end{equation}
       
   772 
       
   773 \noindent We have also shown that testing whether the empty
       
   774 string is in a language is equivalent to the $nullable$
       
   775 function; see \eqref{nullableprop}. That means
       
   776 \eqref{prefinalstep} is equivalent with
       
   777  
       
   778 \[
       
   779 nullable(ders\,s\,r)
       
   780 \] 
       
   781 
       
   782 \noindent But this is just the definition of $matches$
       
   783 
       
   784 \[
       
   785 matches\,s\,r \dn nullable(ders\,s\,r)
       
   786 \]
       
   787 
       
   788 \noindent In effect we have shown
       
   789 
       
   790 \[
       
   791 matches\,s\,r\;\;\text{if and only if}\;\;
       
   792 s\in L(r)
       
   793 \]
       
   794 
       
   795 \noindent which is the property we set out to prove:
       
   796 our algorithm meets its specification. To have done
       
   797 so, requires a few induction proofs about strings and
       
   798 regular expressions. Following the recipes is already a big 
       
   799 step in performing these proofs.
   736 
   800 
   737 \end{document}
   801 \end{document}
   738 
   802 
   739 
   803 
   740 
   804