14 In this chapter we introduce simplifications |
14 In this chapter we introduce simplifications |
15 for annotated regular expressions that can be applied to |
15 for annotated regular expressions that can be applied to |
16 each intermediate derivative result. This allows |
16 each intermediate derivative result. This allows |
17 us to make $\blexer$ much more efficient. |
17 us to make $\blexer$ much more efficient. |
18 Sulzmann and Lu already introduced some simplifications for bitcoded regular expressions, |
18 Sulzmann and Lu already introduced some simplifications for bitcoded regular expressions, |
19 but their simplification functions were inefficient and in some cases needs fixing. |
19 but their simplification functions were inefficient and in some cases needed fixing. |
20 %We contrast our simplification function |
20 %We contrast our simplification function |
21 %with Sulzmann and Lu's, indicating the simplicity of our algorithm. |
21 %with Sulzmann and Lu's, indicating the simplicity of our algorithm. |
22 %This is another case for the usefulness |
22 %This is another case for the usefulness |
23 %and reliability of formal proofs on algorithms. |
23 %and reliability of formal proofs on algorithms. |
24 %These ``aggressive'' simplifications would not be possible in the injection-based |
24 %These ``aggressive'' simplifications would not be possible in the injection-based |
49 \begin{center} |
49 \begin{center} |
50 $((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*$ |
50 $((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*$ |
51 \end{center} |
51 \end{center} |
52 because the duplicates are |
52 because the duplicates are |
53 not next to each other and therefore the rule |
53 not next to each other and therefore the rule |
54 $r+ r \rightarrow r$ does not fire. |
54 $r+ r \rightarrow r$ from $\textit{simp}$ does not fire. |
55 One would expect a better simplification function to work in the |
55 One would expect a better simplification function to work in the |
56 following way: |
56 following way: |
57 \begin{gather*} |
57 \begin{gather*} |
58 ((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* + |
58 ((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* + |
59 \underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}.\\ |
59 \underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}.\\ |
92 $\textit{simp}\_{SL} \; _{bs}\sum []$ & $\dn$ & $\ZERO$\\ |
92 $\textit{simp}\_{SL} \; _{bs}\sum []$ & $\dn$ & $\ZERO$\\ |
93 $\textit{simp}\_{SL} \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ & |
93 $\textit{simp}\_{SL} \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ & |
94 $_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\ |
94 $_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\ |
95 $\textit{simp}\_{SL} \; _{bs}\sum[r]$ & $\dn$ & $\fuse \; bs \; (\textit{simp}\_{SL} \; r)$\\ |
95 $\textit{simp}\_{SL} \; _{bs}\sum[r]$ & $\dn$ & $\fuse \; bs \; (\textit{simp}\_{SL} \; r)$\\ |
96 $\textit{simp}\_{SL} \; _{bs}\sum(r::rs)$ & $\dn$ & $_{bs}\sum |
96 $\textit{simp}\_{SL} \; _{bs}\sum(r::rs)$ & $\dn$ & $_{bs}\sum |
97 (\nub \; (\filter \; (\not \circ \zeroable)\;((\textit{simp}\_{SL} \; r) :: \map \; \textit{simp}\_{SL} \; rs)))$\\ |
97 (\nub \; (\filter \; (\neg\zeroable)\;((\textit{simp}\_{SL} \; r) :: \map \; \textit{simp}\_{SL} \; rs)))$\\ |
98 |
98 |
99 \end{tabular} |
99 \end{tabular} |
100 \end{center} |
100 \end{center} |
101 \noindent |
101 \noindent |
102 The $\textit{zeroable}$ predicate |
102 The $\textit{zeroable}$ predicate |
204 $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s} |
204 $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s} |
205 $ using Sulzmann and Lu's lexer}\label{SulzmannLuLexerTime} |
205 $ using Sulzmann and Lu's lexer}\label{SulzmannLuLexerTime} |
206 \end{figure} |
206 \end{figure} |
207 \noindent |
207 \noindent |
208 which seems like a counterexample for |
208 which seems like a counterexample for |
209 their linear complexity claim: |
209 Sulzmann and Lu's linear complexity claim |
|
210 in their paper \cite{Sulzmann2014}: |
210 \begin{quote}\it |
211 \begin{quote}\it |
211 Linear-Time Complexity Claim \\It is easy to see that each call of one of the functions/operations: |
212 ``Linear-Time Complexity Claim \\It is easy to see that each call of one of the functions/operations: |
212 simp, fuse, mkEpsBC and isPhi leads to subcalls whose number is bound by the size of the regular expression involved. We claim that thanks to aggressively applying simp this size remains finite. Hence, we can argue that the above mentioned functions/operations have constant time complexity which implies that we can incrementally compute bit-coded parse trees in linear time in the size of the input. |
213 simp, fuse, mkEpsBC and isPhi leads to subcalls whose number is bound by the size of the regular expression involved. We claim that thanks to aggressively applying simp this size remains finite. Hence, we can argue that the above mentioned functions/operations have constant time complexity which implies that we can incrementally compute bit-coded parse trees in linear time in the size of the input.'' |
213 \end{quote} |
214 \end{quote} |
214 \noindent |
215 \noindent |
215 The assumption that the size of the regular expressions |
216 The assumption that the size of the regular expressions |
216 in the algorithm |
217 in the algorithm |
217 would stay below a finite constant is not true, not in the |
218 would stay below a finite constant is not true, at least not in the |
218 examples we considered. |
219 examples we considered. |
219 The main reason behind this is that (i) the $\textit{nub}$ |
220 The main reason behind this is that (i) Haskell's $\textit{nub}$ |
220 function requires identical annotations between two |
221 function requires identical annotations between two |
221 annotated regular expressions to qualify as duplicates, |
222 annotated regular expressions to qualify as duplicates, |
222 and therefore cannot simplify cases like $_{SZZ}a^*+_{SZS}a^*$ |
223 and therefore cannot simplify cases like $_{SZZ}a^*+_{SZS}a^*$ |
223 even if both $a^*$ denote the same language, and |
224 even if both $a^*$ denote the same language, and |
224 (ii) the ``flattening'' only applies to the head of the list |
225 (ii) the ``flattening'' only applies to the head of the list |
230 $_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\ |
231 $_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\ |
231 \end{tabular} |
232 \end{tabular} |
232 \end{center} |
233 \end{center} |
233 \noindent |
234 \noindent |
234 clause, and therefore is not strong enough to simplify all |
235 clause, and therefore is not strong enough to simplify all |
235 needed parts of the regular expression. Moreover, even if the regular expressions size |
236 needed parts of the regular expression. Moreover, |
236 do stay finite, one has to take into account that |
237 the $\textit{simp}\_{SL}$ function is applied repeatedly |
237 the $\textit{simp}\_{SL}$ function is applied many times |
238 in each derivative step until a fixed point is reached, |
238 in each derivative step, and that number is not necessarily |
239 which makes the algorithm even more |
239 a constant with respect to the size of the regular expression. |
240 unpredictable and inefficient. |
240 %To not get ``caught off guard'' by |
241 %To not get ``caught off guard'' by |
241 %these counterexamples, |
242 %these counterexamples, |
242 %one needs to be more careful when designing the |
243 %one needs to be more careful when designing the |
243 %simplification function and making claims about them. |
244 %simplification function and making claims about them. |
244 |
245 |
245 \section{Our $\textit{Simp}$ Function} |
246 \section{Our $\textit{Simp}$ Function} |
246 We will now introduce our simplification function. |
247 We will now introduce our own simplification function. |
247 %by making a contrast with $\textit{simp}\_{SL}$. |
248 %by making a contrast with $\textit{simp}\_{SL}$. |
248 We also describe |
249 We also describe |
249 the ideas behind Sulzmann and Lu's $\textit{simp}\_{SL}$ |
250 the ideas behind Sulzmann and Lu's $\textit{simp}\_{SL}$ |
250 algorithm |
251 algorithm |
251 and why it fails to achieve the desired effect. |
252 and why it fails to achieve the desired effect of keeping the sizes of derivatives finitely bounded. |
252 Our simplification function comes with a formal |
253 In addition, our simplification function will come with a formal |
253 correctness proof. |
254 correctness proof. |
254 \subsection{Flattening Nested Alternatives} |
255 \subsection{Flattening Nested Alternatives} |
255 The idea behind the clause |
256 The idea behind the clause |
256 \begin{center} |
257 \begin{center} |
257 $\textit{simp}\_{SL} \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2) \quad \dn \quad |
258 $\textit{simp}\_{SL} \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2) \quad \dn \quad |
398 \end{center} |
399 \end{center} |
399 |
400 |
400 \noindent |
401 \noindent |
401 The simplification (named $\textit{bsimp}$ for \emph{b}it-coded) |
402 The simplification (named $\textit{bsimp}$ for \emph{b}it-coded) |
402 does a pattern matching on the regular expression. |
403 does a pattern matching on the regular expression. |
403 When it detected that the regular expression is an alternative or |
404 When it detects that the regular expression is an alternative or |
404 sequence, it will try to simplify its children regular expressions |
405 sequence, it will try to simplify its children regular expressions |
405 recursively and then see if one of the children turns into $\ZERO$ or |
406 recursively and then see if one of the children turns into $\ZERO$ or |
406 $\ONE$, which might trigger further simplification at the current level. |
407 $\ONE$, which might trigger further simplification at the current level. |
407 Current level simplifications are handled by the function $\textit{bsimp}_{ASEQ}$, |
408 Current level simplifications are handled by the function $\textit{bsimp}_{ASEQ}$, |
408 using rules such as $\ZERO \cdot r \rightarrow \ZERO$ and $\ONE \cdot r \rightarrow r$. |
409 using rules such as $\ZERO \cdot r \rightarrow \ZERO$ and $\ONE \cdot r \rightarrow r$. |
457 derivatives with simplifications from our $\simp$ function |
458 derivatives with simplifications from our $\simp$ function |
458 is called $\blexersimp$: |
459 is called $\blexersimp$: |
459 \begin{center} |
460 \begin{center} |
460 \begin{tabular}{lcl} |
461 \begin{tabular}{lcl} |
461 $\textit{blexer\_simp}\;r\,s$ & $\dn$ & |
462 $\textit{blexer\_simp}\;r\,s$ & $\dn$ & |
462 $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\ |
463 $\textit{let}\;a = (r^\uparrow)\backslash_{bsimp}\, s\;\textit{in}$\\ |
463 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ |
464 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ |
464 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ |
465 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ |
465 & & $\;\;\textit{else}\;\textit{None}$ |
466 & & $\;\;\textit{else}\;\textit{None}$ |
466 \end{tabular} |
467 \end{tabular} |
467 \end{center} |
468 \end{center} |
468 |
469 \noindent |
469 \noindent |
470 This algorithm keeps the regular expression size small, |
470 This algorithm keeps the regular expression size small. |
471 as we shall demonstrate with some examples in the next section. |
471 |
472 |
472 |
473 |
473 \subsection{Examples $(a+aa)^*$ and $(a^*\cdot a^*)^*$ |
474 \subsection{Examples $(a+aa)^*$ and $(a^*\cdot a^*)^*$ |
474 After Simplification} |
475 After Simplification} |
475 Recall the |
476 Recall the |
508 \addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data}; |
509 \addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data}; |
509 \end{axis} |
510 \end{axis} |
510 \end{tikzpicture} |
511 \end{tikzpicture} |
511 \end{tabular} |
512 \end{tabular} |
512 \end{center} |
513 \end{center} |
513 \caption{Our Improvement over Sulzmann and Lu's in terms of size} |
514 \caption{Our Improvement over Sulzmann and Lu's in terms of size of the derivatives.} |
514 \end{figure} |
515 \end{figure} |
515 \noindent |
516 \noindent |
516 Given the size difference, it is not |
517 Given the size difference, it is not |
517 surprising that our $\blexersimp$ significantly outperforms |
518 surprising that our $\blexersimp$ significantly outperforms |
518 $\textit{blexer\_SLSimp}$. |
519 $\textit{blexer\_SLSimp}$ by Sulzmann and Lu. |
519 In the next section we are going to establish that our |
520 In the next section we are going to establish that our |
520 simplification preserves the correctness of the algorithm. |
521 simplification preserves the correctness of the algorithm. |
521 %---------------------------------------------------------------------------------------- |
522 %---------------------------------------------------------------------------------------- |
522 % SECTION rewrite relation |
523 % SECTION rewrite relation |
523 %---------------------------------------------------------------------------------------- |
524 %---------------------------------------------------------------------------------------- |
619 \noindent |
620 \noindent |
620 The rules $LT$ and $LH$ are for rewriting two regular expression lists |
621 The rules $LT$ and $LH$ are for rewriting two regular expression lists |
621 such that one regular expression |
622 such that one regular expression |
622 in the left-hand-side list is rewritable in one step |
623 in the left-hand-side list is rewritable in one step |
623 to the right-hand-side's regular expression at the same position. |
624 to the right-hand-side's regular expression at the same position. |
624 This helps with defining the ``context rule'' $AL$.\\ |
625 This helps with defining the ``context rule'' $AL$. |
|
626 |
625 The reflexive transitive closure of $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$ |
627 The reflexive transitive closure of $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$ |
626 are defined in the usual way: |
628 are defined in the usual way: |
627 \begin{figure}[H] |
629 \begin{figure}[H] |
628 \centering |
630 \centering |
629 \begin{mathpar} |
631 \begin{mathpar} |
773 \implies \bmkeps \; r = \bmkeps \; r'$} |
775 \implies \bmkeps \; r = \bmkeps \; r'$} |
774 Intuitively, this property says we can |
776 Intuitively, this property says we can |
775 extract the same bitcodes using $\bmkeps$ from the nullable |
777 extract the same bitcodes using $\bmkeps$ from the nullable |
776 components of two regular expressions $r$ and $r'$, |
778 components of two regular expressions $r$ and $r'$, |
777 if we can rewrite from one to the other in finitely |
779 if we can rewrite from one to the other in finitely |
778 many steps.\\ |
780 many steps. |
|
781 |
779 For convenience, |
782 For convenience, |
780 we define a predicate for a list of regular expressions |
783 we define a predicate for a list of regular expressions |
781 having at least one nullable regular expressions: |
784 having at least one nullable regular expressions: |
782 \begin{center} |
785 \begin{center} |
783 $\textit{bnullables} \; rs \quad \dn \quad \exists r \in rs. \;\; \bnullable \; r$ |
786 $\textit{bnullables} \; rs \quad \dn \quad \exists r \in rs. \;\; \bnullable \; r$ |
848 \subsubsection{Property 2: $r \stackrel{*}{\rightsquigarrow} \textit{bsimp} \; r$} |
851 \subsubsection{Property 2: $r \stackrel{*}{\rightsquigarrow} \textit{bsimp} \; r$} |
849 Now we get to the key part of the proof, |
852 Now we get to the key part of the proof, |
850 which says that our simplification's helper functions |
853 which says that our simplification's helper functions |
851 such as $\distinctBy$ and $\flts$ describe |
854 such as $\distinctBy$ and $\flts$ describe |
852 reducts of $\stackrel{s*}{\rightsquigarrow}$ and |
855 reducts of $\stackrel{s*}{\rightsquigarrow}$ and |
853 $\rightsquigarrow^* $.\\ |
856 $\rightsquigarrow^* $. |
|
857 |
854 The first lemma to prove is a more general version of |
858 The first lemma to prove is a more general version of |
855 $rs_ 1 \rightsquigarrow^* \distinctBy \; rs_1 \; \phi$: |
859 $rs_ 1 \rightsquigarrow^* \distinctBy \; rs_1 \; \phi$: |
856 \begin{lemma} |
860 \begin{lemma} |
857 $rs_1 @ rs_2 \stackrel{s*}{\rightsquigarrow} |
861 $rs_1 @ rs_2 \stackrel{s*}{\rightsquigarrow} |
858 (rs_1 @ (\distinctBy \; rs_2 \; \; \rerases \;\; (\map\;\; \rerases \; \; rs_1)))$ |
862 (rs_1 @ (\distinctBy \; rs_2 \; \; \rerases \;\; (\map\;\; \rerases \; \; rs_1)))$ |
970 \begin{proof} |
974 \begin{proof} |
971 We can rewrite in many steps from the original lexer's |
975 We can rewrite in many steps from the original lexer's |
972 derivative regular expressions to the |
976 derivative regular expressions to the |
973 lexer with simplification applied (by lemma \ref{bderBderssimp}): |
977 lexer with simplification applied (by lemma \ref{bderBderssimp}): |
974 \begin{center} |
978 \begin{center} |
975 $a \backslash s \stackrel{*}{\rightsquigarrow} \bderssimp{a}{s} $. |
979 $a \backslash s \rightsquigarrow^* \bderssimp{a}{s} $. |
976 \end{center} |
980 \end{center} |
977 We know that they generate the same bits, if the lexing result is a match: |
981 We know that they generate the same bits, if the lexing result is a match: |
978 \begin{center} |
982 \begin{center} |
979 $\bnullable \; (a \backslash s) |
983 $\bnullable \; (a \backslash s) |
980 \implies \bmkeps \; (a \backslash s) = \bmkeps \; (\bderssimp{a}{s})$ |
984 \implies \bmkeps \; (a \backslash s) = \bmkeps \; (\bderssimp{a}{s})$ |
996 \begin{center} |
1000 \begin{center} |
997 $(r, s) \rightarrow v \;\; \textit{iff}\;\; \blexer \; r \; s = \Some \;v$\\ |
1001 $(r, s) \rightarrow v \;\; \textit{iff}\;\; \blexer \; r \; s = \Some \;v$\\ |
998 $\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer\; |
1002 $\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer\; |
999 r\;s = \None$. |
1003 r\;s = \None$. |
1000 \end{center} |
1004 \end{center} |
1001 and obtain the corollary that the bit-coded lexer with simplification is |
1005 and obtain the property that the bit-coded lexer with simplification is |
1002 indeed correctly outputting POSIX lexing result, if such a result exists. |
1006 indeed correctly generating a POSIX lexing result, if such a result exists. |
1003 \begin{corollary} |
1007 \begin{corollary} |
1004 $(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp \; r\; s = \Some \; v$\\ |
1008 $(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp \; r\; s = \Some \; v$\\ |
1005 $\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp\; |
1009 $\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp\; |
1006 r\;s = \None$. |
1010 r\;s = \None$. |
1007 \end{corollary} |
1011 \end{corollary} |
1013 in \cref{flex_retrieve}. |
1017 in \cref{flex_retrieve}. |
1014 The problem is that both functions $\inj$ and $\retrieve$ require |
1018 The problem is that both functions $\inj$ and $\retrieve$ require |
1015 that the annotated regular expressions stay unsimplified, |
1019 that the annotated regular expressions stay unsimplified, |
1016 so that one can |
1020 so that one can |
1017 correctly compare $v_{i+1}$ and $r_i$ and $v_i$ |
1021 correctly compare $v_{i+1}$ and $r_i$ and $v_i$ |
1018 in diagram \ref{graph:inj} and |
1022 in diagram \ref{graph:inj}. |
1019 ``fit the key into the lock hole''. |
1023 |
1020 |
|
1021 \noindent |
|
1022 We also tried to prove |
1024 We also tried to prove |
1023 \begin{center} |
1025 \begin{center} |
1024 $\textit{bsimp} \;\; (\bderssimp{a}{s}) = |
1026 $\textit{bsimp} \;\; (\bderssimp{a}{s}) = |
1025 \textit{bsimp} \;\; (a\backslash s)$, |
1027 \textit{bsimp} \;\; (a\backslash s)$, |
1026 \end{center} |
1028 \end{center} |
1027 but this turns out to be not true. |
1029 but this turns out to be not true. |
1028 A counterexample would be |
1030 A counterexample is |
1029 \[ a = [(_{Z}1+_{S}c)\cdot [bb \cdot (_{Z}1+_{S}c)]] \;\; |
1031 \[ a = [(_{Z}1+_{S}c)\cdot [bb \cdot (_{Z}1+_{S}c)]] \;\; |
1030 \text{and} \;\; s = bb. |
1032 \text{and} \;\; s = bb. |
1031 \] |
1033 \] |
1032 \noindent |
1034 \noindent |
1033 Then we would have |
1035 Then we would have |
1044 Unfortunately, |
1046 Unfortunately, |
1045 if we apply $\textit{bsimp}$ differently |
1047 if we apply $\textit{bsimp}$ differently |
1046 we will always have this discrepancy. |
1048 we will always have this discrepancy. |
1047 This is due to |
1049 This is due to |
1048 the $\map \; (\fuse\; bs) \; as$ operation |
1050 the $\map \; (\fuse\; bs) \; as$ operation |
1049 happening at different locations in the regular expression.\\ |
1051 happening at different locations in the regular expression. |
|
1052 |
1050 The rewriting relation |
1053 The rewriting relation |
1051 $\rightsquigarrow^*$ |
1054 $\rightsquigarrow^*$ |
1052 allows us to ignore this discrepancy |
1055 allows us to ignore this discrepancy |
1053 and view the expressions |
1056 and view the expressions |
1054 \begin{center} |
1057 \begin{center} |
1056 and\\ |
1059 and\\ |
1057 $_{Z}(_{Z} \ONE + _{S} c)$ |
1060 $_{Z}(_{Z} \ONE + _{S} c)$ |
1058 |
1061 |
1059 \end{center} |
1062 \end{center} |
1060 as equal, because they were both re-written |
1063 as equal, because they were both re-written |
1061 from the same expression.\\ |
1064 from the same expression. |
|
1065 |
1062 The simplification rewriting rules |
1066 The simplification rewriting rules |
1063 given in \ref{rrewriteRules} are by no means |
1067 given in \ref{rrewriteRules} are by no means |
1064 final, |
1068 final, |
1065 one could come up new rules |
1069 one could come up new rules |
1066 such as |
1070 such as |