|      4 \usepackage{../graphics} |      4 \usepackage{../graphics} | 
|      5 \usepackage{../data} |      5 \usepackage{../data} | 
|      6  |      6  | 
|      7  |      7  | 
|      8 \begin{document} |      8 \begin{document} | 
|      9 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017} |      9 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2018} | 
|     10  |     10  | 
|     11  |     11  | 
|     12 \section*{Handout 2 (Regular Expression Matching)} |     12 \section*{Handout 2 (Regular Expression Matching)} | 
|     13  |     13  | 
|     14 This lecture is about implementing a more efficient regular expression |     14 This lecture is about implementing a more efficient regular expression | 
|    442 original set ($L(r_1)$), then $\textit{Der}\,c\,(\textit{Der}\,b\,(\textit{Der}\,a\,(L(r_1))))$  |    442 original set ($L(r_1)$), then $\textit{Der}\,c\,(\textit{Der}\,b\,(\textit{Der}\,a\,(L(r_1))))$  | 
|    443 must contain the empty string. If not, then $abc$ was not in the  |    443 must contain the empty string. If not, then $abc$ was not in the  | 
|    444 language we started with. |    444 language we started with. | 
|    445  |    445  | 
|    446 Our matching algorithm using $\textit{der}$ and $\textit{nullable}$ works |    446 Our matching algorithm using $\textit{der}$ and $\textit{nullable}$ works | 
|    447 similarly, just using regular expression instead of sets. In order to |    447 similarly, just using regular expressions instead of sets. In order to | 
|    448 define our algorithm we need to extend the notion of derivatives from single |    448 define our algorithm we need to extend the notion of derivatives from single | 
|    449 characters to strings. This can be done using the following |    449 characters to strings. This can be done using the following | 
|    450 function, taking a string and a regular expression as input and |    450 function, taking a string and a regular expression as input and | 
|    451 a regular expression as output. |    451 a regular expression as output. | 
|    452  |    452  | 
|    461 the time from the original string until the string is exhausted. |    461 the time from the original string until the string is exhausted. | 
|    462 Having $\textit{der}s$ in place, we can finally define our matching |    462 Having $\textit{der}s$ in place, we can finally define our matching | 
|    463 algorithm: |    463 algorithm: | 
|    464  |    464  | 
|    465 \[ |    465 \[ | 
|    466 \textit{matches}\,s\,r \dn \textit{nullable}(\textit{ders}\,s\,r) |    466 \textit{matches}\,r\,s \dn \textit{nullable}(\textit{ders}\,s\,r) | 
|    467 \] |    467 \] | 
|    468  |    468  | 
|    469 \noindent |    469 \noindent | 
|    470 and we can claim that |    470 and we can claim that | 
|    471  |    471  | 
|    472 \[ |    472 \[ | 
|    473 \textit{matches}\,s\,r\quad\text{if and only if}\quad s\in L(r) |    473 \textit{matches}\,r\,s\quad\text{if and only if}\quad s\in L(r) | 
|    474 \] |    474 \] | 
|    475  |    475  | 
|    476 \noindent holds, which means our algorithm satisfies the |    476 \noindent holds, which means our algorithm satisfies the | 
|    477 specification. Of course we can claim many things\ldots |    477 specification. Of course we can claim many things\ldots | 
|    478 whether the claim holds any water is a different question, |    478 whether the claim holds any water is a different question, | 
|    677 \noindent I leave it to you to contemplate whether such a |    677 \noindent I leave it to you to contemplate whether such a | 
|    678 simplification can have any impact on the correctness of our algorithm |    678 simplification can have any impact on the correctness of our algorithm | 
|    679 (will it change any answers?). Figure~\ref{scala2} gives a |    679 (will it change any answers?). Figure~\ref{scala2} gives a | 
|    680 simplification function that recursively traverses a regular |    680 simplification function that recursively traverses a regular | 
|    681 expression and simplifies it according to the rules given at the |    681 expression and simplifies it according to the rules given at the | 
|    682 beginning. There are only rules for $+$, $\cdot$ and $n$-times (the |    682 beginning. There are only rules for $+$ and $\cdot$. There is  | 
|    683 latter because we added it in the second version of our |    683 no simplification rule for a star, because | 
|    684 matcher). There is no simplification rule for a star, because |         | 
|    685 empirical data and also a little thought showed that simplifying under |    684 empirical data and also a little thought showed that simplifying under | 
|    686 a star is a waste of computation time. The simplification function |    685 a star is a waste of computation time. The simplification function | 
|    687 will be called after every derivation.  This additional step removes |    686 will be called after every derivation.  This additional step removes | 
|    688 all the ``junk'' the derivative function introduced. Does this improve |    687 all the ``junk'' the derivative function introduced. Does this improve | 
|    689 the speed? You bet!! |    688 the speed? You bet!! | 
|    731 a^{\{n\}}$.  We need a third of this time to do the same with strings |    730 a^{\{n\}}$.  We need a third of this time to do the same with strings | 
|    732 up to 11,000 \texttt{a}s.  Similarly, Java 8 and Python needed 30 |    731 up to 11,000 \texttt{a}s.  Similarly, Java 8 and Python needed 30 | 
|    733 seconds to find out the regular expression $(a^*)^* \cdot b$ does not |    732 seconds to find out the regular expression $(a^*)^* \cdot b$ does not | 
|    734 match the string of 28 \texttt{a}s. In Java 9 and later this has been  |    733 match the string of 28 \texttt{a}s. In Java 9 and later this has been  | 
|    735 cranked up to 39,000 \texttt{a}s, but we can do the same in the same  |    734 cranked up to 39,000 \texttt{a}s, but we can do the same in the same  | 
|    736 amount of time for strings composed of nearly 6,000,000 \texttt{a}s: |    735 amount of time for strings composed of nearly 6,000,000 \texttt{a}s.  | 
|         |    736 This is shown in the following plot. | 
|    737  |    737  | 
|    738  |    738  | 
|    739 \begin{center} |    739 \begin{center} | 
|    740 \begin{tikzpicture} |    740 \begin{tikzpicture} | 
|    741 \begin{axis}[ |    741 \begin{axis}[ |