14 In this chapter we introduce simplifications |
14 In this chapter we introduce simplifications |
15 for annotated regular expressions that can be applied to |
15 for annotated regular expressions that can be applied to |
16 each intermediate derivative result. This allows |
16 each intermediate derivative result. This allows |
17 us to make $\blexer$ much more efficient. |
17 us to make $\blexer$ much more efficient. |
18 Sulzmann and Lu already introduced some simplifications for bitcoded regular expressions, |
18 Sulzmann and Lu already introduced some simplifications for bitcoded regular expressions, |
19 but their simplification functions were inefficient and in some cases needed fixing. |
19 but their simplification functions could have been more efficient and in some cases needed fixing. |
20 %We contrast our simplification function |
20 %We contrast our simplification function |
21 %with Sulzmann and Lu's, indicating the simplicity of our algorithm. |
21 %with Sulzmann and Lu's, indicating the simplicity of our algorithm. |
22 %This is another case for the usefulness |
22 %This is another case for the usefulness |
23 %and reliability of formal proofs on algorithms. |
23 %and reliability of formal proofs on algorithms. |
24 %These ``aggressive'' simplifications would not be possible in the injection-based |
24 %These ``aggressive'' simplifications would not be possible in the injection-based |
40 & $\stackrel{\backslash a}{ |
40 & $\stackrel{\backslash a}{ |
41 \longrightarrow} $ & $\ldots$\\ |
41 \longrightarrow} $ & $\ldots$\\ |
42 \end{tabular} |
42 \end{tabular} |
43 \end{center} |
43 \end{center} |
44 \noindent |
44 \noindent |
45 As can be seen, there are serveral duplications. |
45 As can be seen, there are several duplications. |
46 A simple-minded simplification function cannot simplify |
46 A simple-minded simplification function cannot simplify |
47 the third regular expression in the above chain of derivative |
47 the third regular expression in the above chain of derivative |
48 regular expressions, namely |
48 regular expressions, namely |
49 \begin{center} |
49 \begin{center} |
50 $((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*$ |
50 $((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*$ |
51 \end{center} |
51 \end{center} |
52 because the duplicates are |
52 because the duplicates are |
53 not next to each other and therefore the rule |
53 not next to each other, and therefore the rule |
54 $r+ r \rightarrow r$ from $\textit{simp}$ does not fire. |
54 $r+ r \rightarrow r$ from $\textit{simp}$ does not fire. |
55 One would expect a better simplification function to work in the |
55 One would expect a better simplification function to work in the |
56 following way: |
56 following way: |
57 \begin{gather*} |
57 \begin{gather*} |
58 ((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* + |
58 ((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* + |
180 \caption{Lexing the regular expression $(a^*a^*)^*$ against strings of the form |
180 \caption{Lexing the regular expression $(a^*a^*)^*$ against strings of the form |
181 $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s} |
181 $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s} |
182 $ using Sulzmann and Lu's lexer}\label{SulzmannLuLexer} |
182 $ using Sulzmann and Lu's lexer}\label{SulzmannLuLexer} |
183 \end{figure} |
183 \end{figure} |
184 \noindent |
184 \noindent |
185 At $n= 20$ we already get an out of memory error with Scala's normal |
185 At $n= 20$ we already get an out-of-memory error with Scala's normal |
186 JVM heap size settings. |
186 JVM heap size settings. |
187 In fact their simplification does not improve much over |
187 In fact their simplification does not improve much over |
188 the simple-minded simplifications we have shown in \ref{fig:BetterWaterloo}. |
188 the simple-minded simplifications we have shown in \ref{fig:BetterWaterloo}. |
189 The time required also grows exponentially: |
189 The time required also grows exponentially: |
190 \begin{figure}[H] |
190 \begin{figure}[H] |
272 It is more desirable |
272 It is more desirable |
273 to flatten |
273 to flatten |
274 an entire list to open up possibilities for further simplifications |
274 an entire list to open up possibilities for further simplifications |
275 with later regular expressions. |
275 with later regular expressions. |
276 Not flattening the rest of the elements also means that |
276 Not flattening the rest of the elements also means that |
277 the later de-duplication processs |
277 the later de-duplication process |
278 does not fully remove further duplicates. |
278 does not fully remove further duplicates. |
279 For example, |
279 For example, |
280 using $\textit{simp}\_{SL}$ we cannot |
280 using $\textit{simp}\_{SL}$ we cannot |
281 simplify |
281 simplify |
282 \begin{center} |
282 \begin{center} |
416 &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow _{bs}a_1' \cdot a_2'$ |
416 &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow _{bs}a_1' \cdot a_2'$ |
417 \end{tabular} |
417 \end{tabular} |
418 \end{center} |
418 \end{center} |
419 \noindent |
419 \noindent |
420 The most involved part is the $\sum$ clause, where we first call $\flts$ on |
420 The most involved part is the $\sum$ clause, where we first call $\flts$ on |
421 the simplified children regular expression list $\textit{map}\; \textit{bsimp}\; \textit{as}$. |
421 the simplified children regular expression list $\textit{map}\; \textit{bsimp}\; \textit{as}$, |
422 and then call $\distinctBy$ on that list, the predicate determining whether two |
422 and then call $\distinctBy$ on that list. The predicate used in $\distinctBy$ for determining whether two |
423 elements are the same is $\rerases \; r_1 = \rerases\; r_2$. |
423 elements are the same is $\rerases \; r_1 = \rerases\; r_2$. |
424 Finally, depending on whether the regular expression list $as'$ has turned into a |
424 Finally, depending on whether the regular expression list $as'$ has turned into a |
425 singleton or empty list after $\flts$ and $\distinctBy$, $\textit{bsimp}_{ALTS}$ |
425 singleton or empty list after $\flts$ and $\distinctBy$, $\textit{bsimp}_{ALTS}$ |
426 decides whether to keep the current level constructor $\sum$ as it is, and |
426 decides whether to keep the current level constructor $\sum$ as it is, and |
427 removes it when there are less than two elements: |
427 removes it when there are fewer than two elements: |
428 \begin{center} |
428 \begin{center} |
429 \begin{tabular}{lcl} |
429 \begin{tabular}{lcl} |
430 $\textit{bsimp}_{ALTS} \; bs \; as'$ & $ \dn$ & $ as' \; \textit{match}$\\ |
430 $\textit{bsimp}_{ALTS} \; bs \; as'$ & $ \dn$ & $ as' \; \textit{match}$\\ |
431 &&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\ |
431 &&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\ |
432 &&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\ |
432 &&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\ |
619 \end{figure} |
619 \end{figure} |
620 \noindent |
620 \noindent |
621 The rules $LT$ and $LH$ are for rewriting two regular expression lists |
621 The rules $LT$ and $LH$ are for rewriting two regular expression lists |
622 such that one regular expression |
622 such that one regular expression |
623 in the left-hand-side list is rewritable in one step |
623 in the left-hand-side list is rewritable in one step |
624 to the right-hand-side's regular expression at the same position. |
624 to the right-hand side's regular expression at the same position. |
625 This helps with defining the ``context rule'' $AL$. |
625 This helps with defining the ``context rule'' $AL$. |
626 |
626 |
627 The reflexive transitive closure of $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$ |
627 The reflexive transitive closure of $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$ |
628 are defined in the usual way: |
628 are defined in the usual way: |
629 \begin{figure}[H] |
629 \begin{figure}[H] |
675 $rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \implies _{bs} \sum rs_1 \stackrel{*}{\rightsquigarrow} _{bs} \sum rs_2$ |
675 $rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \implies _{bs} \sum rs_1 \stackrel{*}{\rightsquigarrow} _{bs} \sum rs_2$ |
676 \item |
676 \item |
677 $r \rightsquigarrow^* r' \implies _{bs} \sum (r :: rs)\; \rightsquigarrow^*\; _{bs} \sum (r' :: rs)$ |
677 $r \rightsquigarrow^* r' \implies _{bs} \sum (r :: rs)\; \rightsquigarrow^*\; _{bs} \sum (r' :: rs)$ |
678 |
678 |
679 \item |
679 \item |
680 The rewriting in many steps property is composible |
680 The rewriting in many steps property is composable |
681 in terms of the sequence constructor:\\ |
681 in terms of the sequence constructor:\\ |
682 $r_1 \rightsquigarrow^* r_2 |
682 $r_1 \rightsquigarrow^* r_2 |
683 \implies _{bs} r_1 \cdot r_3 \rightsquigarrow^* \; |
683 \implies _{bs} r_1 \cdot r_3 \rightsquigarrow^* \; |
684 _{bs} r_2 \cdot r_3 \quad $ |
684 _{bs} r_2 \cdot r_3 \quad $ |
685 and |
685 and |
702 the inductive cases of $\stackrel{s*}{\rightsquigarrow}$ and $\rightsquigarrow^*$ respectively. |
702 the inductive cases of $\stackrel{s*}{\rightsquigarrow}$ and $\rightsquigarrow^*$ respectively. |
703 The third and fourth points are |
703 The third and fourth points are |
704 by the properties $r_1 \rightsquigarrow r_2 \implies \fuse \; bs \; r_1 \implies \fuse \; bs \; r_2$ and |
704 by the properties $r_1 \rightsquigarrow r_2 \implies \fuse \; bs \; r_1 \implies \fuse \; bs \; r_2$ and |
705 $rs_2 \stackrel{s}{\rightsquigarrow} rs_3 |
705 $rs_2 \stackrel{s}{\rightsquigarrow} rs_3 |
706 \implies \map \; (\fuse \; bs) rs_2 \stackrel{s*}{\rightsquigarrow} \map \; (\fuse \; bs)\; rs_3$, |
706 \implies \map \; (\fuse \; bs) rs_2 \stackrel{s*}{\rightsquigarrow} \map \; (\fuse \; bs)\; rs_3$, |
707 which can be indutively proven by the inductive cases of $\rightsquigarrow$ and |
707 which can be inductively proven by the inductive cases of $\rightsquigarrow$ and |
708 $\stackrel{s}{\rightsquigarrow}$. |
708 $\stackrel{s}{\rightsquigarrow}$. |
709 \end{proof} |
709 \end{proof} |
710 \noindent |
710 \noindent |
711 The inference rules of $\stackrel{s}{\rightsquigarrow}$ |
711 The inference rules of $\stackrel{s}{\rightsquigarrow}$ |
712 are defined in terms of the list cons operation, where |
712 are defined in terms of the list cons operation, where |
779 if we can rewrite from one to the other in finitely |
779 if we can rewrite from one to the other in finitely |
780 many steps. |
780 many steps. |
781 |
781 |
782 For convenience, |
782 For convenience, |
783 we define a predicate for a list of regular expressions |
783 we define a predicate for a list of regular expressions |
784 having at least one nullable regular expressions: |
784 having at least one nullable regular expression: |
785 \begin{center} |
785 \begin{center} |
786 $\textit{bnullables} \; rs \quad \dn \quad \exists r \in rs. \;\; \bnullable \; r$ |
786 $\textit{bnullables} \; rs \quad \dn \quad \exists r \in rs. \;\; \bnullable \; r$ |
787 \end{center} |
787 \end{center} |
788 \noindent |
788 \noindent |
789 The rewriting relation $\rightsquigarrow$ preserves (b)nullability: |
789 The rewriting relation $\rightsquigarrow$ preserves (b)nullability: |
801 \implies \bnullable \; r_1 = \bnullable \; r_2$ |
801 \implies \bnullable \; r_1 = \bnullable \; r_2$ |
802 \end{itemize} |
802 \end{itemize} |
803 \end{lemma} |
803 \end{lemma} |
804 \begin{proof} |
804 \begin{proof} |
805 By rule induction of $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$. |
805 By rule induction of $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$. |
806 The third point is a corollary of the second. |
806 The third point is a result of the second. |
807 \end{proof} |
807 \end{proof} |
808 \noindent |
808 \noindent |
809 For convenience again, |
809 For convenience again, |
810 we define $\bmkepss$ on a list $rs$, |
810 we define $\bmkepss$ on a list $rs$, |
811 which extracts the bit-codes on the first $\bnullable$ element in $rs$: |
811 which extracts the bit-codes on the first $\bnullable$ element in $rs$: |
860 \begin{lemma} |
860 \begin{lemma} |
861 $rs_1 @ rs_2 \stackrel{s*}{\rightsquigarrow} |
861 $rs_1 @ rs_2 \stackrel{s*}{\rightsquigarrow} |
862 (rs_1 @ (\distinctBy \; rs_2 \; \; \rerases \;\; (\map\;\; \rerases \; \; rs_1)))$ |
862 (rs_1 @ (\distinctBy \; rs_2 \; \; \rerases \;\; (\map\;\; \rerases \; \; rs_1)))$ |
863 \end{lemma} |
863 \end{lemma} |
864 \noindent |
864 \noindent |
865 It says that that for a list made of two parts $rs_1 @ rs_2$, |
865 It says that for a list made of two parts $rs_1 @ rs_2$, |
866 one can throw away the duplicate |
866 one can throw away the duplicate |
867 elements in $rs_2$, as well as those that have appeared in $rs_1$. |
867 elements in $rs_2$, as well as those that have appeared in $rs_1$. |
868 \begin{proof} |
868 \begin{proof} |
869 By induction on $rs_2$, where $rs_1$ is allowed to be arbitrary. |
869 By induction on $rs_2$, where $rs_1$ is allowed to be arbitrary. |
870 \end{proof} |
870 \end{proof} |
944 \begin{proof} |
944 \begin{proof} |
945 By induction on $\rightsquigarrow$ |
945 By induction on $\rightsquigarrow$ |
946 and $\stackrel{s}{\rightsquigarrow}$, using a number of the previous lemmas. |
946 and $\stackrel{s}{\rightsquigarrow}$, using a number of the previous lemmas. |
947 \end{proof} |
947 \end{proof} |
948 \noindent |
948 \noindent |
949 Now we can prove property 3, as an immediate corollary: |
949 Now we can prove property 3 as an immediate corollary: |
950 \begin{corollary}\label{rewritesBder} |
950 \begin{corollary}\label{rewritesBder} |
951 $r_1 \rightsquigarrow^* r_2 \implies r_1 \backslash c \rightsquigarrow^* |
951 $r_1 \rightsquigarrow^* r_2 \implies r_1 \backslash c \rightsquigarrow^* |
952 r_2 \backslash c$ |
952 r_2 \backslash c$ |
953 \end{corollary} |
953 \end{corollary} |
954 \begin{proof} |
954 \begin{proof} |
981 We know that they generate the same bits, if the lexing result is a match: |
981 We know that they generate the same bits, if the lexing result is a match: |
982 \begin{center} |
982 \begin{center} |
983 $\bnullable \; (a \backslash s) |
983 $\bnullable \; (a \backslash s) |
984 \implies \bmkeps \; (a \backslash s) = \bmkeps \; (\bderssimp{a}{s})$ |
984 \implies \bmkeps \; (a \backslash s) = \bmkeps \; (\bderssimp{a}{s})$ |
985 \end{center} |
985 \end{center} |
986 Now that they generate the same bits, we know that they give the same value after decoding. |
986 Now that they generate the same bits, we know they also give the same value after decoding. |
987 \begin{center} |
987 \begin{center} |
988 $\bnullable \; (a \backslash s) |
988 $\bnullable \; (a \backslash s) |
989 \implies \decode \; r \; (\bmkeps \; (a \backslash s)) = |
989 \implies \decode \; r \; (\bmkeps \; (a \backslash s)) = |
990 \decode \; r \; (\bmkeps \; (\bderssimp{a}{s}))$ |
990 \decode \; r \; (\bmkeps \; (\bderssimp{a}{s}))$ |
991 \end{center} |
991 \end{center} |
1009 $\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp\; |
1009 $\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp\; |
1010 r\;s = \None$. |
1010 r\;s = \None$. |
1011 \end{corollary} |
1011 \end{corollary} |
1012 |
1012 |
1013 \subsection{Comments on the Proof Techniques Used} |
1013 \subsection{Comments on the Proof Techniques Used} |
1014 Straightforward and simple as the proof may seem, |
1014 Straightforward as the proof may seem, |
1015 the efforts we spent obtaining it were far from trivial. |
1015 the efforts we spent obtaining it were far from trivial. |
1016 We initially attempted to re-use the argument |
1016 We initially attempted to re-use the argument |
1017 in \cref{flex_retrieve}. |
1017 in \cref{flex_retrieve}. |
1018 The problem is that both functions $\inj$ and $\retrieve$ require |
1018 The problem is that both functions $\inj$ and $\retrieve$ require |
1019 that the annotated regular expressions stay unsimplified, |
1019 that the annotated regular expressions stay unsimplified, |
1058 $_{[]}(_{ZZ}\ONE + _{ZS}c ) $\\ |
1058 $_{[]}(_{ZZ}\ONE + _{ZS}c ) $\\ |
1059 and\\ |
1059 and\\ |
1060 $_{Z}(_{Z} \ONE + _{S} c)$ |
1060 $_{Z}(_{Z} \ONE + _{S} c)$ |
1061 |
1061 |
1062 \end{center} |
1062 \end{center} |
1063 as equal, because they were both re-written |
1063 as equal because they were both re-written |
1064 from the same expression. |
1064 from the same expression. |
1065 |
1065 |
1066 The simplification rewriting rules |
1066 The simplification rewriting rules |
1067 given in \ref{rrewriteRules} are by no means |
1067 given in \ref{rrewriteRules} are by no means |
1068 final, |
1068 final, |
1069 one could come up new rules |
1069 one could come up with new rules |
1070 such as |
1070 such as |
1071 $\SEQ r_1 \cdot (\SEQ r_1 \cdot r_3) \rightarrow |
1071 $\SEQ r_1 \cdot (\SEQ r_1 \cdot r_3) \rightarrow |
1072 \SEQs [r_1, r_2, r_3]$. |
1072 \SEQs [r_1, r_2, r_3]$. |
1073 However this does not fit with the proof technique |
1073 However this does not fit with the proof technique |
1074 of our main theorem, but seem to not violate the POSIX |
1074 of our main theorem, but seem to not violate the POSIX |