handouts/ho02.tex
changeset 571 499007a7bce2
parent 566 b153c04834eb
child 618 f4818c95a32e
equal deleted inserted replaced
570:617c3b0e0a81 571:499007a7bce2
     4 \usepackage{../graphics}
     4 \usepackage{../graphics}
     5 \usepackage{../data}
     5 \usepackage{../data}
     6 
     6 
     7 
     7 
     8 \begin{document}
     8 \begin{document}
     9 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017}
     9 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2018}
    10 
    10 
    11 
    11 
    12 \section*{Handout 2 (Regular Expression Matching)}
    12 \section*{Handout 2 (Regular Expression Matching)}
    13 
    13 
    14 This lecture is about implementing a more efficient regular expression
    14 This lecture is about implementing a more efficient regular expression
   442 original set ($L(r_1)$), then $\textit{Der}\,c\,(\textit{Der}\,b\,(\textit{Der}\,a\,(L(r_1))))$ 
   442 original set ($L(r_1)$), then $\textit{Der}\,c\,(\textit{Der}\,b\,(\textit{Der}\,a\,(L(r_1))))$ 
   443 must contain the empty string. If not, then $abc$ was not in the 
   443 must contain the empty string. If not, then $abc$ was not in the 
   444 language we started with.
   444 language we started with.
   445 
   445 
   446 Our matching algorithm using $\textit{der}$ and $\textit{nullable}$ works
   446 Our matching algorithm using $\textit{der}$ and $\textit{nullable}$ works
   447 similarly, just using regular expression instead of sets. In order to
   447 similarly, just using regular expressions instead of sets. In order to
   448 define our algorithm we need to extend the notion of derivatives from single
   448 define our algorithm we need to extend the notion of derivatives from single
   449 characters to strings. This can be done using the following
   449 characters to strings. This can be done using the following
   450 function, taking a string and a regular expression as input and
   450 function, taking a string and a regular expression as input and
   451 a regular expression as output.
   451 a regular expression as output.
   452 
   452 
   461 the time from the original string until the string is exhausted.
   461 the time from the original string until the string is exhausted.
   462 Having $\textit{der}s$ in place, we can finally define our matching
   462 Having $\textit{der}s$ in place, we can finally define our matching
   463 algorithm:
   463 algorithm:
   464 
   464 
   465 \[
   465 \[
   466 \textit{matches}\,s\,r \dn \textit{nullable}(\textit{ders}\,s\,r)
   466 \textit{matches}\,r\,s \dn \textit{nullable}(\textit{ders}\,s\,r)
   467 \]
   467 \]
   468 
   468 
   469 \noindent
   469 \noindent
   470 and we can claim that
   470 and we can claim that
   471 
   471 
   472 \[
   472 \[
   473 \textit{matches}\,s\,r\quad\text{if and only if}\quad s\in L(r)
   473 \textit{matches}\,r\,s\quad\text{if and only if}\quad s\in L(r)
   474 \]
   474 \]
   475 
   475 
   476 \noindent holds, which means our algorithm satisfies the
   476 \noindent holds, which means our algorithm satisfies the
   477 specification. Of course we can claim many things\ldots
   477 specification. Of course we can claim many things\ldots
   478 whether the claim holds any water is a different question,
   478 whether the claim holds any water is a different question,
   677 \noindent I leave it to you to contemplate whether such a
   677 \noindent I leave it to you to contemplate whether such a
   678 simplification can have any impact on the correctness of our algorithm
   678 simplification can have any impact on the correctness of our algorithm
   679 (will it change any answers?). Figure~\ref{scala2} gives a
   679 (will it change any answers?). Figure~\ref{scala2} gives a
   680 simplification function that recursively traverses a regular
   680 simplification function that recursively traverses a regular
   681 expression and simplifies it according to the rules given at the
   681 expression and simplifies it according to the rules given at the
   682 beginning. There are only rules for $+$, $\cdot$ and $n$-times (the
   682 beginning. There are only rules for $+$ and $\cdot$. There is 
   683 latter because we added it in the second version of our
   683 no simplification rule for a star, because
   684 matcher). There is no simplification rule for a star, because
       
   685 empirical data and also a little thought showed that simplifying under
   684 empirical data and also a little thought showed that simplifying under
   686 a star is a waste of computation time. The simplification function
   685 a star is a waste of computation time. The simplification function
   687 will be called after every derivation.  This additional step removes
   686 will be called after every derivation.  This additional step removes
   688 all the ``junk'' the derivative function introduced. Does this improve
   687 all the ``junk'' the derivative function introduced. Does this improve
   689 the speed? You bet!!
   688 the speed? You bet!!
   731 a^{\{n\}}$.  We need a third of this time to do the same with strings
   730 a^{\{n\}}$.  We need a third of this time to do the same with strings
   732 up to 11,000 \texttt{a}s.  Similarly, Java 8 and Python needed 30
   731 up to 11,000 \texttt{a}s.  Similarly, Java 8 and Python needed 30
   733 seconds to find out the regular expression $(a^*)^* \cdot b$ does not
   732 seconds to find out the regular expression $(a^*)^* \cdot b$ does not
   734 match the string of 28 \texttt{a}s. In Java 9 and later this has been 
   733 match the string of 28 \texttt{a}s. In Java 9 and later this has been 
   735 cranked up to 39,000 \texttt{a}s, but we can do the same in the same 
   734 cranked up to 39,000 \texttt{a}s, but we can do the same in the same 
   736 amount of time for strings composed of nearly 6,000,000 \texttt{a}s:
   735 amount of time for strings composed of nearly 6,000,000 \texttt{a}s. 
       
   736 This is shown in the following plot.
   737 
   737 
   738 
   738 
   739 \begin{center}
   739 \begin{center}
   740 \begin{tikzpicture}
   740 \begin{tikzpicture}
   741 \begin{axis}[
   741 \begin{axis}[