11 |
11 |
12 |
12 |
13 \section*{Handout 2 (Regular Expression Matching)} |
13 \section*{Handout 2 (Regular Expression Matching)} |
14 |
14 |
15 This lecture is about implementing a more efficient regular expression |
15 This lecture is about implementing a more efficient regular expression |
16 matcher (the plots on the right)---more efficient than the matchers |
16 matcher (the plots on the right below)---more efficient than the |
17 from regular expression libraries in Ruby, Python and Java (the plots |
17 matchers from regular expression libraries in Ruby, Python and Java |
18 on the left). The first pair of plots show the running time for the |
18 (the plots on the left). The first pair of plots show the running time |
19 regular expressions $a^?{}^{\{n\}}\cdot a^{\{n\}}$ and strings composed |
19 for the regular expression $(a^*)^*\cdot b$ and strings composed of |
20 of $n$ \pcode{a}s. The second pair of plots show the running time |
20 $n$ \pcode{a}s (meaning this regular expression actually does not |
21 for the regular expression $(a^*)^*\cdot b$ and also strings composed |
21 match the strings). The second pair of plots show the running time for |
22 of $n$ \pcode{a}s (meaning this regular expression actually does not |
22 the regular expressions $a^?{}^{\{n\}}\cdot a^{\{n\}}$ and strings |
|
23 also composed of $n$ \pcode{a}s (this time the regular expressions |
23 match the strings). To see the substantial differences in the left |
24 match the strings). To see the substantial differences in the left |
24 and right plots below, note the different scales of the $x$-axes. |
25 and right plots below, note the different scales of the $x$-axes. |
25 |
26 |
|
27 |
|
28 \begin{center} |
|
29 Graphs: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$ |
|
30 \begin{tabular}{@{}cc@{}} |
|
31 \begin{tikzpicture} |
|
32 \begin{axis}[ |
|
33 xlabel={$n$}, |
|
34 x label style={at={(1.05,0.0)}}, |
|
35 ylabel={time in secs}, |
|
36 enlargelimits=false, |
|
37 xtick={0,5,...,30}, |
|
38 xmax=33, |
|
39 ymax=35, |
|
40 ytick={0,5,...,30}, |
|
41 scaled ticks=false, |
|
42 axis lines=left, |
|
43 width=5cm, |
|
44 height=5cm, |
|
45 legend entries={Java, Python}, |
|
46 legend pos=north west, |
|
47 legend cell align=left] |
|
48 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data}; |
|
49 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data}; |
|
50 \end{axis} |
|
51 \end{tikzpicture} |
|
52 & |
|
53 \begin{tikzpicture} |
|
54 \begin{axis}[ |
|
55 xlabel={$n$}, |
|
56 x label style={at={(1.05,0.0)}}, |
|
57 ylabel={time in secs}, |
|
58 enlargelimits=false, |
|
59 ymax=35, |
|
60 ytick={0,5,...,30}, |
|
61 axis lines=left, |
|
62 scaled ticks=false, |
|
63 width=6.5cm, |
|
64 height=5cm, |
|
65 legend entries={Scala V3}, |
|
66 legend pos=north east, |
|
67 legend cell align=left] |
|
68 %\addplot[green,mark=square*,mark options={fill=white}] table {re2a.data}; |
|
69 \addplot[black,mark=square*,mark options={fill=white}] table {re3a.data}; |
|
70 \end{axis} |
|
71 \end{tikzpicture} |
|
72 \end{tabular} |
|
73 \end{center} |
26 |
74 |
27 \begin{center} |
75 \begin{center} |
28 Graphs: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$\\ |
76 Graphs: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$\\ |
29 \begin{tabular}{@{}cc@{}} |
77 \begin{tabular}{@{}cc@{}} |
30 \begin{tikzpicture} |
78 \begin{tikzpicture} |
60 ymax=35, |
108 ymax=35, |
61 ytick={0,5,...,30}, |
109 ytick={0,5,...,30}, |
62 scaled ticks=false, |
110 scaled ticks=false, |
63 axis lines=left, |
111 axis lines=left, |
64 width=6.5cm, |
112 width=6.5cm, |
65 height=5cm] |
113 height=5cm, |
66 \addplot[green,mark=square*,mark options={fill=white}] table {re2.data}; |
114 legend entries={Scala V3}, |
|
115 legend pos=north east, |
|
116 legend cell align=left] |
|
117 %\addplot[green,mark=square*,mark options={fill=white}] table {re2.data}; |
67 \addplot[black,mark=square*,mark options={fill=white}] table {re3.data}; |
118 \addplot[black,mark=square*,mark options={fill=white}] table {re3.data}; |
68 \end{axis} |
119 \end{axis} |
69 \end{tikzpicture} |
120 \end{tikzpicture} |
70 \end{tabular} |
121 \end{tabular} |
71 \end{center} |
122 \end{center} |
72 |
123 \medskip |
73 \begin{center} |
|
74 Graphs: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$ |
|
75 \begin{tabular}{@{}cc@{}} |
|
76 \begin{tikzpicture} |
|
77 \begin{axis}[ |
|
78 xlabel={$n$}, |
|
79 x label style={at={(1.05,0.0)}}, |
|
80 ylabel={time in secs}, |
|
81 enlargelimits=false, |
|
82 xtick={0,5,...,30}, |
|
83 xmax=33, |
|
84 ymax=35, |
|
85 ytick={0,5,...,30}, |
|
86 scaled ticks=false, |
|
87 axis lines=left, |
|
88 width=5cm, |
|
89 height=5cm, |
|
90 legend entries={Java, Python}, |
|
91 legend pos=north west, |
|
92 legend cell align=left] |
|
93 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data}; |
|
94 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data}; |
|
95 \end{axis} |
|
96 \end{tikzpicture} |
|
97 & |
|
98 \begin{tikzpicture} |
|
99 \begin{axis}[ |
|
100 xlabel={$n$}, |
|
101 x label style={at={(1.05,0.0)}}, |
|
102 ylabel={time in secs}, |
|
103 enlargelimits=false, |
|
104 ymax=35, |
|
105 ytick={0,5,...,30}, |
|
106 axis lines=left, |
|
107 scaled ticks=false, |
|
108 width=6.5cm, |
|
109 height=5cm] |
|
110 \addplot[green,mark=square*,mark options={fill=white}] table {re2a.data}; |
|
111 \addplot[black,mark=square*,mark options={fill=white}] table {re3a.data}; |
|
112 \end{axis} |
|
113 \end{tikzpicture} |
|
114 \end{tabular} |
|
115 \end{center}\medskip |
|
116 |
124 |
117 \noindent |
125 \noindent |
118 We will use these regular expressions and strings |
126 We will use these regular expressions and strings as running |
119 as running examples. |
127 examples. There will be several versions (V1, V2, V3,\ldots) of the |
120 |
128 algorithm.\footnote{The corresponding files are \texttt{re1.scala}, |
|
129 \texttt{re2.scala} and so on. As usual, you can find the code on |
|
130 KEATS.}\bigskip |
|
131 |
|
132 \noindent |
121 Having specified in the previous lecture what |
133 Having specified in the previous lecture what |
122 problem our regular expression matcher is supposed to solve, |
134 problem our regular expression matcher is supposed to solve, |
123 namely for any given regular expression $r$ and string $s$ |
135 namely for any given regular expression $r$ and string $s$ |
124 answer \textit{true} if and only if |
136 answer \textit{true} if and only if |
125 |
137 |
569 |
581 |
570 \begin{center} |
582 \begin{center} |
571 \code{case class NTIMES(r: Rexp, n: Int) extends Rexp} |
583 \code{case class NTIMES(r: Rexp, n: Int) extends Rexp} |
572 \end{center} |
584 \end{center} |
573 |
585 |
574 \noindent With this fix we have a constant ``size'' regular |
586 \noindent With this fix we have a constant ``size'' regular expression |
575 expression for our running example no matter how large $n$ is. |
587 for our running example no matter how large $n$ is (see the |
576 This means we have to also add cases for \pcode{NTIMES} in the |
588 \texttt{size} section in the implementations). This means we have to |
577 functions $\textit{nullable}$ and $\textit{der}$. Does the change have any |
589 also add cases for \pcode{NTIMES} in the functions $\textit{nullable}$ |
578 effect? |
590 and $\textit{der}$. Does the change have any effect? |
579 |
591 |
580 \begin{center} |
592 \begin{center} |
581 \begin{tikzpicture} |
593 \begin{tikzpicture} |
582 \begin{axis}[ |
594 \begin{axis}[ |
583 title={Graph: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$}, |
595 title={Graph: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$}, |
601 \addplot[green,mark=square*,mark options={fill=white}] table {re2.data}; |
613 \addplot[green,mark=square*,mark options={fill=white}] table {re2.data}; |
602 \end{axis} |
614 \end{axis} |
603 \end{tikzpicture} |
615 \end{tikzpicture} |
604 \end{center} |
616 \end{center} |
605 |
617 |
606 \noindent Now we are talking business! The modified matcher |
618 \noindent Now we are talking business! The modified matcher can within |
607 can within 25 seconds handle regular expressions up to |
619 25 seconds handle regular expressions up to $n = 1,100$ before a |
608 $n = 1,100$ before a StackOverflow is raised. Recall that Python and Ruby |
620 StackOverflow is raised. Recall that Python and Ruby (and our first |
609 (and our first version, Scala V1) could only handle $n = 27$ or so in 30 |
621 version, Scala V1) could only handle $n = 27$ or so in 30 |
610 seconds. There is no change for our second example |
622 seconds. There is no change for our second example $(a^*)^* \cdot |
611 $(a^*)^* \cdot b$---so this is still good. |
623 b$---so this is still good. |
612 |
624 |
613 |
625 |
614 The moral is that our algorithm is rather sensitive to the |
626 The moral is that our algorithm is rather sensitive to the |
615 size of regular expressions it needs to handle. This is of |
627 size of regular expressions it needs to handle. This is of |
616 course obvious because both $\textit{nullable}$ and $\textit{der}$ frequently |
628 course obvious because both $\textit{nullable}$ and $\textit{der}$ frequently |
617 need to traverse the whole regular expression. There seems, |
629 need to traverse the whole regular expression. There seems, |
618 however, one more issue for making the algorithm run faster. |
630 however, one more issue for making the algorithm run faster. |
619 The derivative function often produces ``useless'' |
631 The derivative function often produces ``useless'' |
620 $\ZERO$s and $\ONE$s. To see this, consider $r = ((a |
632 $\ZERO$s and $\ONE$s. To see this, consider $r = ((a |
621 \cdot b) + b)^*$ and the following two derivatives |
633 \cdot b) + b)^*$ and the following three derivatives |
622 |
634 |
623 \begin{center} |
635 \begin{center} |
624 \begin{tabular}{l} |
636 \begin{tabular}{l} |
625 $\textit{der}\,a\,r = ((\ONE \cdot b) + \ZERO) \cdot r$\\ |
637 $\textit{der}\,a\,r = ((\ONE \cdot b) + \ZERO) \cdot r$\\ |
626 $\textit{der}\,b\,r = ((\ZERO \cdot b) + \ONE)\cdot r$\\ |
638 $\textit{der}\,b\,r = ((\ZERO \cdot b) + \ONE)\cdot r$\\ |
640 $\textit{der}\,c\,r \equiv \ZERO$ |
652 $\textit{der}\,c\,r \equiv \ZERO$ |
641 \end{tabular} |
653 \end{tabular} |
642 \end{center} |
654 \end{center} |
643 |
655 |
644 \noindent I leave it to you to contemplate whether such a |
656 \noindent I leave it to you to contemplate whether such a |
645 simplification can have any impact on the correctness of our |
657 simplification can have any impact on the correctness of our algorithm |
646 algorithm (will it change any answers?). Figure~\ref{scala2} |
658 (will it change any answers?). Figure~\ref{scala2} gives a |
647 gives a simplification function that recursively traverses a |
659 simplification function that recursively traverses a regular |
648 regular expression and simplifies it according to the rules |
660 expression and simplifies it according to the rules given at the |
649 given at the beginning. There are only rules for $+$, $\cdot$ |
661 beginning. There are only rules for $+$, $\cdot$ and $n$-times (the |
650 and $n$-times (the latter because we added it in the second |
662 latter because we added it in the second version of our |
651 version of our matcher). There is no rule for a star, because |
663 matcher). There is no simplification rule for a star, because |
652 empirical data and also a little thought showed that |
664 empirical data and also a little thought showed that simplifying under |
653 simplifying under a star is a waste of computation time. The |
665 a star is a waste of computation time. The simplification function |
654 simplification function will be called after every derivation. |
666 will be called after every derivation. This additional step removes |
655 This additional step removes all the ``junk'' the derivative |
667 all the ``junk'' the derivative function introduced. Does this improve |
656 function introduced. Does this improve the speed? You bet!! |
668 the speed? You bet!! |
657 |
669 |
658 \begin{figure}[p] |
670 \begin{figure}[p] |
659 \lstinputlisting[numbers=left,linebackgroundcolor= |
671 \lstinputlisting[numbers=left,linebackgroundcolor= |
660 {\ifodd\value{lstnumber}\color{capri!3}\fi}] |
672 {\ifodd\value{lstnumber}\color{capri!3}\fi}] |
661 {../progs/app6.scala} |
673 {../progs/app6.scala} |
673 title={Graph: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$}, |
685 title={Graph: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$}, |
674 xlabel={$n$}, |
686 xlabel={$n$}, |
675 x label style={at={(1.04,0.0)}}, |
687 x label style={at={(1.04,0.0)}}, |
676 ylabel={time in secs}, |
688 ylabel={time in secs}, |
677 enlargelimits=false, |
689 enlargelimits=false, |
678 xtick={0,3000,...,9000}, |
690 xtick={0,2500,...,10000}, |
679 xmax=10000, |
691 xmax=12000, |
680 ytick={0,5,...,30}, |
692 ytick={0,5,...,30}, |
681 ymax=32, |
693 ymax=32, |
682 scaled ticks=false, |
694 scaled ticks=false, |
683 axis lines=left, |
695 axis lines=left, |
684 width=9cm, |
696 width=9cm, |
691 \end{axis} |
703 \end{axis} |
692 \end{tikzpicture} |
704 \end{tikzpicture} |
693 \end{center} |
705 \end{center} |
694 |
706 |
695 \noindent |
707 \noindent |
696 To reacap, Python and Ruby needed approximately 30 seconds to match |
708 To reacap, Python and Ruby needed approximately 30 seconds to match a |
697 a string of 28 \texttt{a}s and the regular expression $a^{?\{n\}} \cdot a^{\{n\}}$. |
709 string of 28 \texttt{a}s and the regular expression $a^{?\{n\}} \cdot |
698 We need a third of this time to do the same with strings up to 12,000 \texttt{a}s. |
710 a^{\{n\}}$. We need a third of this time to do the same with strings |
699 Similarly, Java needed 30 seconds to find out the regular expression |
711 up to 11,000 \texttt{a}s. Similarly, Java and Python needed 30 |
700 $(a^*)^* \cdot b$ does not match the string of 28 \texttt{a}s. We can do |
712 seconds to find out the regular expression $(a^*)^* \cdot b$ does not |
701 the same in approximately 5 seconds for strings of 6000000 \texttt{a}s: |
713 match the string of 28 \texttt{a}s. We can do the same in |
|
714 for strings of 6,000,000 \texttt{a}s: |
702 |
715 |
703 |
716 |
704 \begin{center} |
717 \begin{center} |
705 \begin{tikzpicture} |
718 \begin{tikzpicture} |
706 \begin{axis}[ |
719 \begin{axis}[ |
707 title={Graph: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$}, |
720 title={Graph: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$}, |
708 xlabel={$n$}, |
721 xlabel={$n$}, |
709 x label style={at={(1.09,0.0)}}, |
|
710 ylabel={time in secs}, |
722 ylabel={time in secs}, |
711 enlargelimits=false, |
723 enlargelimits=false, |
712 xmax=7700000, |
724 ymax=35, |
713 ytick={0,5,...,30}, |
725 ytick={0,5,...,30}, |
714 ymax=32, |
|
715 %scaled ticks=false, |
|
716 axis lines=left, |
726 axis lines=left, |
|
727 scaled ticks=false, |
|
728 x label style={at={(1.09,0.0)}}, |
|
729 %xmax=7700000, |
717 width=9cm, |
730 width=9cm, |
718 height=5cm, |
731 height=5cm, |
719 legend entries={Scala V2, Scala V3}, |
732 legend entries={Scala V3}, |
720 legend pos=outer north east, |
733 legend pos=outer north east, |
721 legend cell align=left] |
734 legend cell align=left] |
722 \addplot[green,mark=square*,mark options={fill=white}] table {re2a.data}; |
735 %\addplot[green,mark=square*,mark options={fill=white}] table {re2a.data}; |
723 \addplot[black,mark=square*,mark options={fill=white}] table {re3a.data}; |
736 \addplot[black,mark=square*,mark options={fill=white}] table {re3a.data}; |
724 \end{axis} |
737 \end{axis} |
725 \end{tikzpicture} |
738 \end{tikzpicture} |
726 \end{center} |
739 \end{center} |
727 |
740 |
728 \subsection*{Epilogue} |
741 \subsection*{Epilogue} |
729 |
742 |
730 (23/Aug/2016) I recently found another place where this algorithm can be |
743 (23/Aug/2016) I recently found another place where this algorithm can be |
731 sped (this idea is not integrated with what is coming next, |
744 sped up (this idea is not integrated with what is coming next, |
732 but I present it nonetheless). The idea is to define \texttt{ders} |
745 but I present it nonetheless). The idea is to define \texttt{ders} |
733 not such that it iterates the derivative character-by-character, but |
746 not such that it iterates the derivative character-by-character, but |
734 in bigger chunks. The resulting code for \texttt{ders2} looks as |
747 in bigger chunks. The resulting code for \texttt{ders2} looks as |
735 follows: |
748 follows: |
736 |
749 |