ChengsongTanPhdThesis/Chapters/Bitcoded2.tex
changeset 640 bd1354127574
parent 639 80cc6dc4c98b
child 649 ef2b8abcbc55
equal deleted inserted replaced
639:80cc6dc4c98b 640:bd1354127574
    14 In this chapter we introduce simplifications
    14 In this chapter we introduce simplifications
    15 for annotated regular expressions that can be applied to 
    15 for annotated regular expressions that can be applied to 
    16 each intermediate derivative result. This allows
    16 each intermediate derivative result. This allows
    17 us to make $\blexer$ much more efficient.
    17 us to make $\blexer$ much more efficient.
    18 Sulzmann and Lu already introduced some simplifications for bitcoded regular expressions,
    18 Sulzmann and Lu already introduced some simplifications for bitcoded regular expressions,
    19 but their simplification functions  were inefficient and in some cases needed fixing.
    19 but their simplification functions could have been more efficient and in some cases needed fixing.
    20 %We contrast our simplification function 
    20 %We contrast our simplification function 
    21 %with Sulzmann and Lu's, indicating the simplicity of our algorithm.
    21 %with Sulzmann and Lu's, indicating the simplicity of our algorithm.
    22 %This is another case for the usefulness 
    22 %This is another case for the usefulness 
    23 %and reliability of formal proofs on algorithms.
    23 %and reliability of formal proofs on algorithms.
    24 %These ``aggressive'' simplifications would not be possible in the injection-based 
    24 %These ``aggressive'' simplifications would not be possible in the injection-based 
    40 							     & $\stackrel{\backslash a}{
    40 							     & $\stackrel{\backslash a}{
    41 	\longrightarrow} $ & $\ldots$\\
    41 	\longrightarrow} $ & $\ldots$\\
    42 	\end{tabular}
    42 	\end{tabular}
    43 \end{center}
    43 \end{center}
    44 \noindent
    44 \noindent
    45 As can be seen, there are serveral duplications.
    45 As can be seen, there are several duplications.
    46 A simple-minded simplification function cannot simplify
    46 A simple-minded simplification function cannot simplify
    47 the third regular expression in the above chain of derivative
    47 the third regular expression in the above chain of derivative
    48 regular expressions, namely
    48 regular expressions, namely
    49 \begin{center}
    49 \begin{center}
    50 $((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*$
    50 $((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*$
    51 \end{center}
    51 \end{center}
    52 because the duplicates are
    52 because the duplicates are
    53 not next to each other and therefore the rule
    53 not next to each other, and therefore the rule
    54 $r+ r \rightarrow r$ from $\textit{simp}$ does not fire.
    54 $r+ r \rightarrow r$ from $\textit{simp}$ does not fire.
    55 One would expect a better simplification function to work in the 
    55 One would expect a better simplification function to work in the 
    56 following way:
    56 following way:
    57 \begin{gather*}
    57 \begin{gather*}
    58 	((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* + 
    58 	((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* + 
   180 \caption{Lexing the regular expression $(a^*a^*)^*$ against strings of the form
   180 \caption{Lexing the regular expression $(a^*a^*)^*$ against strings of the form
   181 $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}
   181 $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}
   182 $ using Sulzmann and Lu's lexer}\label{SulzmannLuLexer}
   182 $ using Sulzmann and Lu's lexer}\label{SulzmannLuLexer}
   183 \end{figure}
   183 \end{figure}
   184 \noindent
   184 \noindent
   185 At $n= 20$ we already get an out of memory error with Scala's normal 
   185 At $n= 20$ we already get an out-of-memory error with Scala's normal 
   186 JVM heap size settings.
   186 JVM heap size settings.
   187 In fact their simplification does not improve much over
   187 In fact their simplification does not improve much over
   188 the simple-minded simplifications we have shown in \ref{fig:BetterWaterloo}.
   188 the simple-minded simplifications we have shown in \ref{fig:BetterWaterloo}.
   189 The time required also grows exponentially:
   189 The time required also grows exponentially:
   190 \begin{figure}[H]
   190 \begin{figure}[H]
   272 It is more desirable
   272 It is more desirable
   273 to flatten
   273 to flatten
   274 an entire list to open up possibilities for further simplifications
   274 an entire list to open up possibilities for further simplifications
   275 with later regular expressions.
   275 with later regular expressions.
   276 Not flattening the rest of the elements also means that
   276 Not flattening the rest of the elements also means that
   277 the later de-duplication processs 
   277 the later de-duplication process 
   278 does not fully remove further duplicates.
   278 does not fully remove further duplicates.
   279 For example,
   279 For example,
   280 using $\textit{simp}\_{SL}$ we cannot
   280 using $\textit{simp}\_{SL}$ we cannot
   281 simplify
   281 simplify
   282 \begin{center}
   282 \begin{center}
   416    &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow   _{bs}a_1' \cdot a_2'$ 
   416    &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow   _{bs}a_1' \cdot a_2'$ 
   417 	\end{tabular}
   417 	\end{tabular}
   418 \end{center}
   418 \end{center}
   419 \noindent
   419 \noindent
   420 The most involved part is the $\sum$ clause, where we first call $\flts$ on
   420 The most involved part is the $\sum$ clause, where we first call $\flts$ on
   421 the simplified children regular expression list $\textit{map}\; \textit{bsimp}\; \textit{as}$.
   421 the simplified children regular expression list $\textit{map}\; \textit{bsimp}\; \textit{as}$,
   422 and then call $\distinctBy$ on that list, the predicate determining whether two 
   422 and then call $\distinctBy$ on that list. The predicate used in $\distinctBy$ for determining whether two 
   423 elements are the same is $\rerases \; r_1 = \rerases\; r_2$.
   423 elements are the same is $\rerases \; r_1 = \rerases\; r_2$.
   424 Finally, depending on whether the regular expression list $as'$ has turned into a
   424 Finally, depending on whether the regular expression list $as'$ has turned into a
   425 singleton or empty list after $\flts$ and $\distinctBy$, $\textit{bsimp}_{ALTS}$
   425 singleton or empty list after $\flts$ and $\distinctBy$, $\textit{bsimp}_{ALTS}$
   426 decides whether to keep the current level constructor $\sum$ as it is, and 
   426 decides whether to keep the current level constructor $\sum$ as it is, and 
   427 removes it when there are less than two elements:
   427 removes it when there are fewer than two elements:
   428 \begin{center}
   428 \begin{center}
   429 	\begin{tabular}{lcl}
   429 	\begin{tabular}{lcl}
   430 		$\textit{bsimp}_{ALTS} \; bs \; as'$ & $ \dn$ & $ as' \; \textit{match}$\\		
   430 		$\textit{bsimp}_{ALTS} \; bs \; as'$ & $ \dn$ & $ as' \; \textit{match}$\\		
   431   &&$\quad\textit{case} \; [] \Rightarrow  \ZERO$ \\
   431   &&$\quad\textit{case} \; [] \Rightarrow  \ZERO$ \\
   432    &&$\quad\textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
   432    &&$\quad\textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
   619 \end{figure}
   619 \end{figure}
   620 \noindent
   620 \noindent
   621 The rules $LT$ and $LH$ are for rewriting two regular expression lists
   621 The rules $LT$ and $LH$ are for rewriting two regular expression lists
   622 such that one regular expression
   622 such that one regular expression
   623 in the left-hand-side list is rewritable in one step
   623 in the left-hand-side list is rewritable in one step
   624 to the right-hand-side's regular expression at the same position.
   624 to the right-hand side's regular expression at the same position.
   625 This helps with defining the ``context rule'' $AL$.
   625 This helps with defining the ``context rule'' $AL$.
   626 
   626 
   627 The reflexive transitive closure of $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$
   627 The reflexive transitive closure of $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$
   628 are defined in the usual way:
   628 are defined in the usual way:
   629 \begin{figure}[H]
   629 \begin{figure}[H]
   675 			$rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \implies _{bs} \sum rs_1 \stackrel{*}{\rightsquigarrow} _{bs} \sum rs_2$
   675 			$rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \implies _{bs} \sum rs_1 \stackrel{*}{\rightsquigarrow} _{bs} \sum rs_2$
   676 		\item
   676 		\item
   677 			$r \rightsquigarrow^* r' \implies _{bs} \sum (r :: rs)\; \rightsquigarrow^*\;  _{bs} \sum (r' :: rs)$
   677 			$r \rightsquigarrow^* r' \implies _{bs} \sum (r :: rs)\; \rightsquigarrow^*\;  _{bs} \sum (r' :: rs)$
   678 
   678 
   679 		\item
   679 		\item
   680 			The rewriting in many steps property is composible 
   680 			The rewriting in many steps property is composable 
   681 			in terms of the sequence constructor:\\
   681 			in terms of the sequence constructor:\\
   682 			$r_1 \rightsquigarrow^* r_2 
   682 			$r_1 \rightsquigarrow^* r_2 
   683 			\implies _{bs} r_1 \cdot r_3 \rightsquigarrow^* \;  
   683 			\implies _{bs} r_1 \cdot r_3 \rightsquigarrow^* \;  
   684 			_{bs} r_2 \cdot r_3 \quad $ 
   684 			_{bs} r_2 \cdot r_3 \quad $ 
   685 			and 
   685 			and 
   702 	the inductive cases of $\stackrel{s*}{\rightsquigarrow}$ and $\rightsquigarrow^*$ respectively.
   702 	the inductive cases of $\stackrel{s*}{\rightsquigarrow}$ and $\rightsquigarrow^*$ respectively.
   703 	The third and fourth points are 
   703 	The third and fourth points are 
   704 	by the properties $r_1 \rightsquigarrow r_2 \implies \fuse \; bs \; r_1 \implies \fuse \; bs \; r_2$ and
   704 	by the properties $r_1 \rightsquigarrow r_2 \implies \fuse \; bs \; r_1 \implies \fuse \; bs \; r_2$ and
   705 	$rs_2 \stackrel{s}{\rightsquigarrow} rs_3 
   705 	$rs_2 \stackrel{s}{\rightsquigarrow} rs_3 
   706 	\implies \map \; (\fuse \; bs) rs_2 \stackrel{s*}{\rightsquigarrow} \map \; (\fuse \; bs)\; rs_3$,
   706 	\implies \map \; (\fuse \; bs) rs_2 \stackrel{s*}{\rightsquigarrow} \map \; (\fuse \; bs)\; rs_3$,
   707 	which can be indutively proven by the inductive cases of $\rightsquigarrow$ and 
   707 	which can be inductively proven by the inductive cases of $\rightsquigarrow$ and 
   708 	$\stackrel{s}{\rightsquigarrow}$.
   708 	$\stackrel{s}{\rightsquigarrow}$.
   709 \end{proof}
   709 \end{proof}
   710 \noindent
   710 \noindent
   711 The inference rules of $\stackrel{s}{\rightsquigarrow}$
   711 The inference rules of $\stackrel{s}{\rightsquigarrow}$
   712 are defined in terms of the list cons operation, where
   712 are defined in terms of the list cons operation, where
   779 if we can rewrite from one to the other in finitely
   779 if we can rewrite from one to the other in finitely
   780 many steps.
   780 many steps.
   781 
   781 
   782 For convenience, 
   782 For convenience, 
   783 we define a predicate for a list of regular expressions
   783 we define a predicate for a list of regular expressions
   784 having at least one nullable regular expressions:
   784 having at least one nullable regular expression:
   785 \begin{center}
   785 \begin{center}
   786 	$\textit{bnullables} \; rs \quad \dn \quad \exists r \in rs. \;\; \bnullable \; r$
   786 	$\textit{bnullables} \; rs \quad \dn \quad \exists r \in rs. \;\; \bnullable \; r$
   787 \end{center}
   787 \end{center}
   788 \noindent
   788 \noindent
   789 The rewriting relation $\rightsquigarrow$ preserves (b)nullability:
   789 The rewriting relation $\rightsquigarrow$ preserves (b)nullability:
   801 			\implies \bnullable \; r_1 = \bnullable \; r_2$
   801 			\implies \bnullable \; r_1 = \bnullable \; r_2$
   802 	\end{itemize}
   802 	\end{itemize}
   803 \end{lemma}
   803 \end{lemma}
   804 \begin{proof}
   804 \begin{proof}
   805 	By rule induction of $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$.
   805 	By rule induction of $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$.
   806 	The third point is a corollary of the second.
   806 	The third point is a result of the second.
   807 \end{proof}
   807 \end{proof}
   808 \noindent
   808 \noindent
   809 For convenience again,
   809 For convenience again,
   810 we define $\bmkepss$ on a list $rs$,
   810 we define $\bmkepss$ on a list $rs$,
   811 which extracts the bit-codes on the first $\bnullable$ element in $rs$:
   811 which extracts the bit-codes on the first $\bnullable$ element in $rs$:
   860 \begin{lemma}
   860 \begin{lemma}
   861 	$rs_1 @ rs_2 \stackrel{s*}{\rightsquigarrow} 
   861 	$rs_1 @ rs_2 \stackrel{s*}{\rightsquigarrow} 
   862 	(rs_1 @ (\distinctBy \; rs_2 \; \; \rerases \;\; (\map\;\; \rerases \; \; rs_1)))$
   862 	(rs_1 @ (\distinctBy \; rs_2 \; \; \rerases \;\; (\map\;\; \rerases \; \; rs_1)))$
   863 \end{lemma}
   863 \end{lemma}
   864 \noindent
   864 \noindent
   865 It says that that for a list made of two parts $rs_1 @ rs_2$, 
   865 It says that for a list made of two parts $rs_1 @ rs_2$, 
   866 one can throw away the duplicate
   866 one can throw away the duplicate
   867 elements in $rs_2$, as well as those that have appeared in $rs_1$.
   867 elements in $rs_2$, as well as those that have appeared in $rs_1$.
   868 \begin{proof}
   868 \begin{proof}
   869 	By induction on $rs_2$, where $rs_1$ is allowed to be arbitrary.
   869 	By induction on $rs_2$, where $rs_1$ is allowed to be arbitrary.
   870 \end{proof}
   870 \end{proof}
   944 \begin{proof}
   944 \begin{proof}
   945 	By induction on $\rightsquigarrow$ 
   945 	By induction on $\rightsquigarrow$ 
   946 	and $\stackrel{s}{\rightsquigarrow}$, using a number of the previous lemmas.
   946 	and $\stackrel{s}{\rightsquigarrow}$, using a number of the previous lemmas.
   947 \end{proof}
   947 \end{proof}
   948 \noindent
   948 \noindent
   949 Now we can prove property 3, as an immediate corollary:
   949 Now we can prove property 3 as an immediate corollary:
   950 \begin{corollary}\label{rewritesBder}
   950 \begin{corollary}\label{rewritesBder}
   951 	$r_1 \rightsquigarrow^* r_2 \implies r_1 \backslash c \rightsquigarrow^*   
   951 	$r_1 \rightsquigarrow^* r_2 \implies r_1 \backslash c \rightsquigarrow^*   
   952 	r_2 \backslash c$
   952 	r_2 \backslash c$
   953 \end{corollary}
   953 \end{corollary}
   954 \begin{proof}
   954 \begin{proof}
   981 	We know that they generate the same bits, if the lexing result is a match:
   981 	We know that they generate the same bits, if the lexing result is a match:
   982 	\begin{center}
   982 	\begin{center}
   983 		$\bnullable \; (a \backslash s) 
   983 		$\bnullable \; (a \backslash s) 
   984 		\implies \bmkeps \; (a \backslash s) = \bmkeps \; (\bderssimp{a}{s})$
   984 		\implies \bmkeps \; (a \backslash s) = \bmkeps \; (\bderssimp{a}{s})$
   985 	\end{center}
   985 	\end{center}
   986 	Now that they generate the same bits, we know that they give the same value after decoding.
   986 	Now that they generate the same bits, we know they also give the same value after decoding.
   987 	\begin{center}
   987 	\begin{center}
   988 		$\bnullable \; (a \backslash s) 
   988 		$\bnullable \; (a \backslash s) 
   989 		\implies \decode \; r \; (\bmkeps \; (a \backslash s)) = 
   989 		\implies \decode \; r \; (\bmkeps \; (a \backslash s)) = 
   990 		\decode \; r \; (\bmkeps \; (\bderssimp{a}{s}))$
   990 		\decode \; r \; (\bmkeps \; (\bderssimp{a}{s}))$
   991 	\end{center}
   991 	\end{center}
  1009 	$\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp\;
  1009 	$\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp\;
  1010 	r\;s = \None$.
  1010 	r\;s = \None$.
  1011 \end{corollary}
  1011 \end{corollary}
  1012 
  1012 
  1013 \subsection{Comments on the Proof Techniques Used}
  1013 \subsection{Comments on the Proof Techniques Used}
  1014 Straightforward and simple as the proof may seem,
  1014 Straightforward as the proof may seem,
  1015 the efforts we spent obtaining it were far from trivial.
  1015 the efforts we spent obtaining it were far from trivial.
  1016 We initially attempted to re-use the argument 
  1016 We initially attempted to re-use the argument 
  1017 in \cref{flex_retrieve}. 
  1017 in \cref{flex_retrieve}. 
  1018 The problem is that both functions $\inj$ and $\retrieve$ require 
  1018 The problem is that both functions $\inj$ and $\retrieve$ require 
  1019 that the annotated regular expressions stay unsimplified, 
  1019 that the annotated regular expressions stay unsimplified, 
  1058 	$_{[]}(_{ZZ}\ONE +  _{ZS}c ) $\\
  1058 	$_{[]}(_{ZZ}\ONE +  _{ZS}c ) $\\
  1059 	and\\
  1059 	and\\
  1060 	$_{Z}(_{Z} \ONE + _{S} c)$
  1060 	$_{Z}(_{Z} \ONE + _{S} c)$
  1061 
  1061 
  1062 \end{center}
  1062 \end{center}
  1063 as equal, because they were both re-written
  1063 as equal because they were both re-written
  1064 from the same expression.
  1064 from the same expression.
  1065 
  1065 
  1066 The simplification rewriting rules
  1066 The simplification rewriting rules
  1067 given in \ref{rrewriteRules} are by no means
  1067 given in \ref{rrewriteRules} are by no means
  1068 final,
  1068 final,
  1069 one could come up new rules
  1069 one could come up with new rules
  1070 such as 
  1070 such as 
  1071 $\SEQ r_1 \cdot (\SEQ r_1 \cdot r_3) \rightarrow
  1071 $\SEQ r_1 \cdot (\SEQ r_1 \cdot r_3) \rightarrow
  1072 \SEQs [r_1, r_2, r_3]$.
  1072 \SEQs [r_1, r_2, r_3]$.
  1073 However this does not fit with the proof technique
  1073 However this does not fit with the proof technique
  1074 of our main theorem, but seem to not violate the POSIX
  1074 of our main theorem, but seem to not violate the POSIX