4 \usepackage{../graphics} |
4 \usepackage{../graphics} |
5 \usepackage{../data} |
5 \usepackage{../data} |
6 |
6 |
7 |
7 |
8 \begin{document} |
8 \begin{document} |
9 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017} |
9 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2018} |
10 |
10 |
11 |
11 |
12 \section*{Handout 2 (Regular Expression Matching)} |
12 \section*{Handout 2 (Regular Expression Matching)} |
13 |
13 |
14 This lecture is about implementing a more efficient regular expression |
14 This lecture is about implementing a more efficient regular expression |
442 original set ($L(r_1)$), then $\textit{Der}\,c\,(\textit{Der}\,b\,(\textit{Der}\,a\,(L(r_1))))$ |
442 original set ($L(r_1)$), then $\textit{Der}\,c\,(\textit{Der}\,b\,(\textit{Der}\,a\,(L(r_1))))$ |
443 must contain the empty string. If not, then $abc$ was not in the |
443 must contain the empty string. If not, then $abc$ was not in the |
444 language we started with. |
444 language we started with. |
445 |
445 |
446 Our matching algorithm using $\textit{der}$ and $\textit{nullable}$ works |
446 Our matching algorithm using $\textit{der}$ and $\textit{nullable}$ works |
447 similarly, just using regular expression instead of sets. In order to |
447 similarly, just using regular expressions instead of sets. In order to |
448 define our algorithm we need to extend the notion of derivatives from single |
448 define our algorithm we need to extend the notion of derivatives from single |
449 characters to strings. This can be done using the following |
449 characters to strings. This can be done using the following |
450 function, taking a string and a regular expression as input and |
450 function, taking a string and a regular expression as input and |
451 a regular expression as output. |
451 a regular expression as output. |
452 |
452 |
461 the time from the original string until the string is exhausted. |
461 the time from the original string until the string is exhausted. |
462 Having $\textit{der}s$ in place, we can finally define our matching |
462 Having $\textit{der}s$ in place, we can finally define our matching |
463 algorithm: |
463 algorithm: |
464 |
464 |
465 \[ |
465 \[ |
466 \textit{matches}\,s\,r \dn \textit{nullable}(\textit{ders}\,s\,r) |
466 \textit{matches}\,r\,s \dn \textit{nullable}(\textit{ders}\,s\,r) |
467 \] |
467 \] |
468 |
468 |
469 \noindent |
469 \noindent |
470 and we can claim that |
470 and we can claim that |
471 |
471 |
472 \[ |
472 \[ |
473 \textit{matches}\,s\,r\quad\text{if and only if}\quad s\in L(r) |
473 \textit{matches}\,r\,s\quad\text{if and only if}\quad s\in L(r) |
474 \] |
474 \] |
475 |
475 |
476 \noindent holds, which means our algorithm satisfies the |
476 \noindent holds, which means our algorithm satisfies the |
477 specification. Of course we can claim many things\ldots |
477 specification. Of course we can claim many things\ldots |
478 whether the claim holds any water is a different question, |
478 whether the claim holds any water is a different question, |
677 \noindent I leave it to you to contemplate whether such a |
677 \noindent I leave it to you to contemplate whether such a |
678 simplification can have any impact on the correctness of our algorithm |
678 simplification can have any impact on the correctness of our algorithm |
679 (will it change any answers?). Figure~\ref{scala2} gives a |
679 (will it change any answers?). Figure~\ref{scala2} gives a |
680 simplification function that recursively traverses a regular |
680 simplification function that recursively traverses a regular |
681 expression and simplifies it according to the rules given at the |
681 expression and simplifies it according to the rules given at the |
682 beginning. There are only rules for $+$, $\cdot$ and $n$-times (the |
682 beginning. There are only rules for $+$ and $\cdot$. There is |
683 latter because we added it in the second version of our |
683 no simplification rule for a star, because |
684 matcher). There is no simplification rule for a star, because |
|
685 empirical data and also a little thought showed that simplifying under |
684 empirical data and also a little thought showed that simplifying under |
686 a star is a waste of computation time. The simplification function |
685 a star is a waste of computation time. The simplification function |
687 will be called after every derivation. This additional step removes |
686 will be called after every derivation. This additional step removes |
688 all the ``junk'' the derivative function introduced. Does this improve |
687 all the ``junk'' the derivative function introduced. Does this improve |
689 the speed? You bet!! |
688 the speed? You bet!! |
731 a^{\{n\}}$. We need a third of this time to do the same with strings |
730 a^{\{n\}}$. We need a third of this time to do the same with strings |
732 up to 11,000 \texttt{a}s. Similarly, Java 8 and Python needed 30 |
731 up to 11,000 \texttt{a}s. Similarly, Java 8 and Python needed 30 |
733 seconds to find out the regular expression $(a^*)^* \cdot b$ does not |
732 seconds to find out the regular expression $(a^*)^* \cdot b$ does not |
734 match the string of 28 \texttt{a}s. In Java 9 and later this has been |
733 match the string of 28 \texttt{a}s. In Java 9 and later this has been |
735 cranked up to 39,000 \texttt{a}s, but we can do the same in the same |
734 cranked up to 39,000 \texttt{a}s, but we can do the same in the same |
736 amount of time for strings composed of nearly 6,000,000 \texttt{a}s: |
735 amount of time for strings composed of nearly 6,000,000 \texttt{a}s. |
|
736 This is shown in the following plot. |
737 |
737 |
738 |
738 |
739 \begin{center} |
739 \begin{center} |
740 \begin{tikzpicture} |
740 \begin{tikzpicture} |
741 \begin{axis}[ |
741 \begin{axis}[ |