ChengsongTanPhdThesis/Chapters/Bitcoded2.tex
changeset 600 fd068f39ac23
parent 591 b2d0de6aee18
child 601 ce4e5151a836
equal deleted inserted replaced
599:a5f666410101 600:fd068f39ac23
     9 %its correctness proof in 
     9 %its correctness proof in 
    10 %Chapter 3\ref{Chapter3}. 
    10 %Chapter 3\ref{Chapter3}. 
    11 
    11 
    12 
    12 
    13 
    13 
    14 In this chapter we introduce the simplifications
    14 In this chapter we introduce simplifications
    15 on annotated regular expressions that can be applied to 
    15 on annotated regular expressions that can be applied to 
    16 each intermediate derivative result. This allows
    16 each intermediate derivative result. This allows
    17 us to make $\blexer$ much more efficient.
    17 us to make $\blexer$ much more efficient.
    18 We contrast this simplification function 
    18 Sulzmann and Lu already had some bit-coded simplifications,
    19 with Sulzmann and Lu's original
    19 but their simplification functions  were inefficient.
    20 simplifications, indicating the simplicity of our algorithm and
    20 We contrast our simplification function 
    21 improvements we made, demostrating
    21 with Sulzmann and Lu's, indicating the simplicity of our algorithm.
    22 the usefulness and reliability of formal proofs on algorithms.
    22 This is another case for the usefulness 
       
    23 and reliability of formal proofs on algorithms.
    23 These ``aggressive'' simplifications would not be possible in the injection-based 
    24 These ``aggressive'' simplifications would not be possible in the injection-based 
    24 lexing we introduced in chapter \ref{Inj}.
    25 lexing we introduced in chapter \ref{Inj}.
    25 We then go on to prove the correctness with the improved version of 
    26 We then prove the correctness with the improved version of 
    26 $\blexer$, called $\blexersimp$, by establishing 
    27 $\blexer$, called $\blexersimp$, by establishing 
    27 $\blexer \; r \; s= \blexersimp \; r \; s$ using a term rewriting system.
    28 $\blexer \; r \; s= \blexersimp \; r \; s$ using a term rewriting system.
    28 
    29 
    29 \section{Simplifications by Sulzmann and Lu}
    30 \section{Simplifications by Sulzmann and Lu}
    30 The first thing we notice in the fast growth of examples such as $(a^*a^*)^*$'s
    31 Consider the derivatives of examples such as $(a^*a^*)^*$
    31 and $(a^* + (aa)^*)^*$'s derivatives is that a lot of duplicated sub-patterns
    32 and $(a^* + (aa)^*)^*$:
    32 are scattered around different levels, and therefore requires 
       
    33 de-duplication at different levels:
       
    34 \begin{center}
    33 \begin{center}
    35 	$(a^*a^*)^* \stackrel{\backslash a}{\longrightarrow} (a^*a^* + a^*)\cdot(a^*a^*)^* \stackrel{\backslash a}{\longrightarrow} $\\
    34 	$(a^*a^*)^* \stackrel{\backslash a}{\longrightarrow} (a^*a^* + a^*)\cdot(a^*a^*)^* \stackrel{\backslash a}{\longrightarrow} $\\
    36 	$((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^* \stackrel{\backslash a}{\longrightarrow} \ldots$
    35 	$((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^* \stackrel{\backslash a}{\longrightarrow} \ldots$
    37 \end{center}
    36 \end{center}
    38 \noindent
    37 \noindent
    39 As we have already mentioned in \ref{eqn:growth2},
    38 As can be seen, there is a lot of duplication 
    40 a simple-minded simplification function cannot simplify
    39 in the example we have already mentioned in 
       
    40 \ref{eqn:growth2}.
       
    41 A simple-minded simplification function cannot simplify
    41 the third regular expression in the above chain of derivative
    42 the third regular expression in the above chain of derivative
    42 regular expressions:
    43 regular expressions, namely
    43 \begin{center}
    44 \begin{center}
    44 $((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*$
    45 $((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*$
    45 \end{center}
    46 \end{center}
    46 one would expect a better simplification function to work in the 
    47 because the duplicates are
       
    48 not next to each other and therefore the rule
       
    49 $r+ r \rightarrow r$ does not fire.
       
    50 One would expect a better simplification function to work in the 
    47 following way:
    51 following way:
    48 \begin{gather*}
    52 \begin{gather*}
    49 	((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* + 
    53 	((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* + 
    50 	\underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}.\\
    54 	\underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}.\\
    51 	\bigg\downarrow \\
    55 	\bigg\downarrow \\
    59 	\bigg\downarrow \\
    63 	\bigg\downarrow \\
    60 	(a^*a^* + a^* 
    64 	(a^*a^* + a^* 
    61 	)\cdot(a^*a^*)^*  
    65 	)\cdot(a^*a^*)^*  
    62 \end{gather*}
    66 \end{gather*}
    63 \noindent
    67 \noindent
    64 This motivating example came from testing Sulzmann and Lu's 
    68 In the first step, the nested alternative regular expression
       
    69 $(a^*a^* + a^*) + a^*$ is flattened into $a^*a^* + a^* + a^*$.
       
    70 Now the third term $a^*$ is clearly identified as a duplicate
       
    71 and therefore removed in the second step. This causes the two
       
    72 top-level terms to become the same and the second $(a^*a^*+a^*)\cdot(a^*a^*)^*$ 
       
    73 removed in the final step.\\
       
    74 This motivating example is from testing Sulzmann and Lu's 
    65 algorithm: their simplification does 
    75 algorithm: their simplification does 
    66 not work!
    76 not work!
    67 We quote their $\textit{simp}$ function verbatim here:
    77 Consider their simplification (using our notations):
    68 \begin{center}
    78 \begin{center}
    69 	\begin{tabular}{lcl}
    79 	\begin{tabular}{lcl}
    70 		$\simpsulz \; _{bs}(_{bs'}\ONE \cdot r)$ & $\dn$ & 
    80 		$\simpsulz \; _{bs}(_{bs'}\ONE \cdot r)$ & $\dn$ & 
    71 		$\textit{if} \; (\textit{zeroable} \; r)\; \textit{then} \;\; \ZERO$\\
    81 		$\textit{if} \; (\textit{zeroable} \; r)\; \textit{then} \;\; \ZERO$\\
    72 						   & &$\textit{else}\;\; \fuse \; (bs@ bs') \; r$\\
    82 						   & &$\textit{else}\;\; \fuse \; (bs@ bs') \; r$\\
    83 		(\nub \; (\filter \; (\not \circ \zeroable)\;((\simpsulz  \; r) :: \map \; \simpsulz  \; rs)))$\\ 
    93 		(\nub \; (\filter \; (\not \circ \zeroable)\;((\simpsulz  \; r) :: \map \; \simpsulz  \; rs)))$\\ 
    84 		
    94 		
    85 	\end{tabular}
    95 	\end{tabular}
    86 \end{center}
    96 \end{center}
    87 \noindent
    97 \noindent
    88 the $\textit{zeroable}$ predicate 
    98 where the $\textit{zeroable}$ predicate 
    89 which tests whether the regular expression
    99 tests whether the regular expression
    90 is equivalent to $\ZERO$,
   100 is equivalent to $\ZERO$,
    91 is defined as:
   101 can be defined as:
    92 \begin{center}
   102 \begin{center}
    93 	\begin{tabular}{lcl}
   103 	\begin{tabular}{lcl}
    94 		$\zeroable \; _{bs}\sum (r::rs)$ & $\dn$ & $\zeroable \; r\;\; \land \;\;
   104 		$\zeroable \; _{bs}\sum (r::rs)$ & $\dn$ & $\zeroable \; r\;\; \land \;\;
    95 		\zeroable \;_{[]}\sum\;rs $\\
   105 		\zeroable \;_{[]}\sum\;rs $\\
    96 		$\zeroable\;_{bs}(r_1 \cdot r_2)$ & $\dn$ & $\zeroable\; r_1 \;\; \lor \;\; \zeroable \; r_2$\\
   106 		$\zeroable\;_{bs}(r_1 \cdot r_2)$ & $\dn$ & $\zeroable\; r_1 \;\; \lor \;\; \zeroable \; r_2$\\
   188 \end{quote}
   198 \end{quote}
   189 \noindent
   199 \noindent
   190 The assumption that the size of the regular expressions
   200 The assumption that the size of the regular expressions
   191 in the algorithm
   201 in the algorithm
   192 would stay below a finite constant is not ture.
   202 would stay below a finite constant is not ture.
       
   203 The main reason behind this is that (i) The $\textit{nub}$
       
   204 function requires identical annotations between two 
       
   205 annotated regular expressions to qualify as duplicates,
       
   206 and cannot simplify the cases like $_{SZZ}a^*+_{SZS}a^*$
       
   207 even if both $a^*$ denote the same language.
       
   208 (ii) The ``flattening'' only applies to the head of the list
       
   209 in the 
       
   210 \begin{center}
       
   211 	\begin{tabular}{lcl}
       
   212 
       
   213 		$\simpsulz  \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ &
       
   214 		$_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\
       
   215 	\end{tabular}
       
   216 \end{center}
       
   217 \noindent
       
   218 clause, and therefore is not thorough enough to simplify all
       
   219 needed parts of the regular expression.\\
   193 In addition to that, even if the regular expressions size
   220 In addition to that, even if the regular expressions size
   194 do stay finite, one has to take into account that
   221 do stay finite, one has to take into account that
   195 the $\simpsulz$ function is applied many times
   222 the $\simpsulz$ function is applied many times
   196 in each derivative step, and that number is not necessarily
   223 in each derivative step, and that number is not necessarily
   197 a constant with respect to the size of the regular expression.
   224 a constant with respect to the size of the regular expression.
   207 the ideas behind components in their algorithm 
   234 the ideas behind components in their algorithm 
   208 and why they fail to achieve the desired effect, followed
   235 and why they fail to achieve the desired effect, followed
   209 by our solution. These solutions come with correctness
   236 by our solution. These solutions come with correctness
   210 statements that are backed up by formal proofs.
   237 statements that are backed up by formal proofs.
   211 \subsection{Flattening Nested Alternatives}
   238 \subsection{Flattening Nested Alternatives}
   212 The idea behind the 
   239 The idea behind the clause
   213 \begin{center}
   240 \begin{center}
   214 $\simpsulz  \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2) \quad \dn \quad
   241 $\simpsulz  \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2) \quad \dn \quad
   215 	       _{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$
   242 	       _{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$
   216 \end{center}
   243 \end{center}
   217 clause is that it allows
   244 is that it allows
   218 duplicate removal of regular expressions at different
   245 duplicate removal of regular expressions at different
   219 levels.
   246 ``levels'' of alternatives.
   220 For example, this would help with the
   247 For example, this would help with the
   221 following simplification:
   248 following simplification:
   222 
   249 
   223 \begin{center}
   250 \begin{center}
   224 $(a+r)+r \longrightarrow a+r$
   251 $(a+r)+r \longrightarrow a+r$
   225 \end{center}
   252 \end{center}
   226 The problem here is that only the head element
   253 The problem here is that only the head element
   227 is ``spilled out'',
   254 is ``spilled out'',
   228 whereas we would want to flatten
   255 whereas we would want to flatten
   229 an entire list to open up possibilities for further simplifications.\\
   256 an entire list to open up possibilities for further simplifications.
   230 Not flattening the rest of the elements also means that
   257 Not flattening the rest of the elements also means that
   231 the later de-duplication processs 
   258 the later de-duplication processs 
   232 does not fully remove apparent duplicates.
   259 does not fully remove further duplicates.
   233 For example,
   260 For example,
   234 using $\simpsulz$ we could not 
   261 using $\simpsulz$ we could not 
   235 simplify
   262 simplify
   236 \begin{center}
   263 \begin{center}
   237 $((a^* a^*)+ (a^* + a^*))\cdot (a^*a^*)^*+
   264 	$((a^* a^*)+\underline{(a^* + a^*)})\cdot (a^*a^*)^*+
   238 ((a^*a^*)+a^*)\cdot (a^*a^*)^*$
   265 ((a^*a^*)+a^*)\cdot (a^*a^*)^*$
   239 \end{center}
   266 \end{center}
   240 due to the underlined part not in the first element
   267 due to the underlined part not in the first element
   241 of the alternative.\\
   268 of the alternative.\\
   242 We define a flatten operation that flattens not only 
   269 We define a flatten operation that flattens not only 
   425 
   452 
   426 \noindent
   453 \noindent
   427 This algorithm keeps the regular expression size small.
   454 This algorithm keeps the regular expression size small.
   428 
   455 
   429 
   456 
   430 \subsection{$(a+aa)^*$ and $(a^*\cdot a^*)^*$  against 
   457 \subsection{Examples $(a+aa)^*$ and $(a^*\cdot a^*)^*$
   431 $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ After Simplification}
   458 After Simplification}
   432 For example,
   459 Recall the
   433 with our simplification the
       
   434 previous $(a^*a^*)^*$ example
   460 previous $(a^*a^*)^*$ example
   435 where $\simpsulz$ could not
   461 where $\simpsulz$ could not
   436 stop the fast growth (over
   462 prevent the fast growth (over
   437 3 million nodes just below $20$ input length)
   463 3 million nodes just below $20$ input length)
   438 will be reduced to just 15 and stays constant, no matter how long the
   464 will be reduced to just 15 and stays constant no matter how long the
   439 input string is.
   465 input string is.
   440 This is demonstrated in the graphs below.
   466 This is shown in the graphs below.
   441 \begin{figure}[H]
   467 \begin{figure}[H]
   442 \begin{center}
   468 \begin{center}
   443 \begin{tabular}{ll}
   469 \begin{tabular}{ll}
   444 \begin{tikzpicture}
   470 \begin{tikzpicture}
   445 \begin{axis}[
   471 \begin{axis}[
   480 %	SECTION rewrite relation
   506 %	SECTION rewrite relation
   481 %----------------------------------------------------------------------------------------
   507 %----------------------------------------------------------------------------------------
   482 \section{Correctness of $\blexersimp$}
   508 \section{Correctness of $\blexersimp$}
   483 In this section we give details
   509 In this section we give details
   484 of the correctness proof of $\blexersimp$,
   510 of the correctness proof of $\blexersimp$,
   485 an important contribution of this thesis.\\
   511 one of the contributions of this thesis.\\
   486 We first introduce the rewriting relation \emph{rrewrite}
   512 We first introduce the rewriting relation \emph{rrewrite}
   487 ($\rrewrite$) between two regular expressions,
   513 ($\rrewrite$) between two regular expressions,
   488 which expresses an atomic
   514 which expresses an atomic
   489 simplification step from the left-hand-side
   515 simplification.
   490 to the right-hand-side.
       
   491 We then prove properties about
   516 We then prove properties about
   492 this rewriting relation and its reflexive transitive closure.
   517 this rewriting relation and its reflexive transitive closure.
   493 Finally we leverage these properties to show
   518 Finally we leverage these properties to show
   494 an equivalence between the internal data structures of 
   519 an equivalence between the internal data structures of 
   495 $\blexer$ and $\blexersimp$.
   520 $\blexer$ and $\blexersimp$.
   496 
   521 
   497 \subsection{The Rewriting Relation $\rrewrite$($\rightsquigarrow$)}
   522 \subsection{The Rewriting Relation $\rrewrite$($\rightsquigarrow$)}
   498 In the $\blexer$'s correctness proof, we
   523 In the $\blexer$'s correctness proof, we
   499 did not directly derive the fact that $\blexer$ gives out the POSIX value,
   524 did not directly derive the fact that $\blexer$ generates the POSIX value,
   500 but first proved that $\blexer$ is linked with $\lexer$.
   525 but first proved that $\blexer$ is linked with $\lexer$.
   501 Then we re-use
   526 Then we re-use
   502 the correctness of $\lexer$
   527 the correctness of $\lexer$
   503 to obtain
   528 to obtain
   504 \begin{center}
   529 \begin{center}
   509 by first proving that
   534 by first proving that
   510 $\blexersimp \; r \; s $ 
   535 $\blexersimp \; r \; s $ 
   511 produces the same output as $\blexer \; r\; s$,
   536 produces the same output as $\blexer \; r\; s$,
   512 and then piecing it together with 
   537 and then piecing it together with 
   513 $\blexer$'s correctness to achieve our main
   538 $\blexer$'s correctness to achieve our main
   514 theorem:\footnote{ the case when 
   539 theorem:\footnote{ The case when 
   515 $s$ is not in $L \; r$, is routine to establish }
   540 $s$ is not in $L \; r$, is routine to establish.}
   516 \begin{center}
   541 \begin{center}
   517 	$(r, s) \rightarrow v \; \;   \textit{iff} \;\;  \blexersimp \; r \; s = v$
   542 	$(r, s) \rightarrow v \; \;   \textit{iff} \;\;  \blexersimp \; r \; s = v$
   518 \end{center}
   543 \end{center}
   519 \noindent
   544 \noindent
   520 The overall idea for the proof
   545 The overall idea for the proof
   524 \begin{center}
   549 \begin{center}
   525 	$r \rightsquigarrow^* \textit{bsimp} \; r$
   550 	$r \rightsquigarrow^* \textit{bsimp} \; r$
   526 \end{center}
   551 \end{center}
   527 where each rewrite step, written $\rightsquigarrow$,
   552 where each rewrite step, written $\rightsquigarrow$,
   528 is an ``atomic'' simplification that
   553 is an ``atomic'' simplification that
   529 cannot be broken down any further:
   554 is similar to a small-step reduction in operational semantics:
   530 \begin{figure}[H]
   555 \begin{figure}[H]
   531 \begin{mathpar}
   556 \begin{mathpar}
   532 	\inferrule * [Right = $S\ZERO_l$]{\vspace{0em}}{_{bs} \ZERO \cdot r_2 \rightsquigarrow \ZERO\\}
   557 	\inferrule * [Right = $S\ZERO_l$]{\vspace{0em}}{_{bs} \ZERO \cdot r_2 \rightsquigarrow \ZERO\\}
   533 
   558 
   534 	\inferrule * [Right = $S\ZERO_r$]{\vspace{0em}}{_{bs} r_1 \cdot \ZERO \rightsquigarrow \ZERO\\}
   559 	\inferrule * [Right = $S\ZERO_r$]{\vspace{0em}}{_{bs} r_1 \cdot \ZERO \rightsquigarrow \ZERO\\}
   590 	\inferrule{rs_1 \stackrel{s*}{\rightsquigarrow}  rs_2 \land \; rs_2 \stackrel{s*}{\rightsquigarrow} rs_3}{rs_1 \stackrel{s*}{\rightsquigarrow} rs_3}
   615 	\inferrule{rs_1 \stackrel{s*}{\rightsquigarrow}  rs_2 \land \; rs_2 \stackrel{s*}{\rightsquigarrow} rs_3}{rs_1 \stackrel{s*}{\rightsquigarrow} rs_3}
   591 \end{mathpar}
   616 \end{mathpar}
   592 \caption{The Reflexive Transitive Closure of 
   617 \caption{The Reflexive Transitive Closure of 
   593 $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$}\label{transClosure}
   618 $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$}\label{transClosure}
   594 \end{figure}
   619 \end{figure}
   595 Two rewritable terms will remain rewritable to each other
   620 %Two rewritable terms will remain rewritable to each other
   596 even after a derivative is taken:
   621 %even after a derivative is taken:
       
   622 Rewriting is preserved under derivatives,
       
   623 namely
   597 \begin{center}
   624 \begin{center}
   598 	$r_1 \rightsquigarrow r_2 \implies (r_1 \backslash c) \rightsquigarrow^* (r_2 \backslash c)$
   625 	$r_1 \rightsquigarrow r_2 \implies (r_1 \backslash c) \rightsquigarrow^* (r_2 \backslash c)$
   599 \end{center}
   626 \end{center}
   600 And finally, if two terms are rewritable to each other,
   627 And finally, if two terms are rewritable to each other,
   601 then they produce the same bitcodes:
   628 then they produce the same bitcodes:
   954 	$(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp \; r\; s $
   981 	$(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp \; r\; s $
   955 \end{corollary}
   982 \end{corollary}
   956 
   983 
   957 \subsection{Comments on the Proof Techniques Used}
   984 \subsection{Comments on the Proof Techniques Used}
   958 Straightforward and simple as the proof may seem,
   985 Straightforward and simple as the proof may seem,
   959 the efforts we spent obtaining it was far from trivial.\\
   986 the efforts we spent obtaining it were far from trivial.\\
   960 We initially attempted to re-use the argument 
   987 We initially attempted to re-use the argument 
   961 in \cref{flex_retrieve}. 
   988 in \cref{flex_retrieve}. 
   962 The problem was that both functions $\inj$ and $\retrieve$ require 
   989 The problem was that both functions $\inj$ and $\retrieve$ require 
   963 that the annotated regular expressions stay unsimplified, 
   990 that the annotated regular expressions stay unsimplified, 
   964 so that one can 
   991 so that one can 
  1005 	$_{Z}(_{Z} \ONE + _{S} c)$
  1032 	$_{Z}(_{Z} \ONE + _{S} c)$
  1006 
  1033 
  1007 \end{center}
  1034 \end{center}
  1008 as equal, because they were both re-written
  1035 as equal, because they were both re-written
  1009 from the same expression.\\
  1036 from the same expression.\\
       
  1037 The simplification rewriting rules
       
  1038 given in \ref{rrewriteRules} are by no means
       
  1039 final,
       
  1040 one could come up new rules
       
  1041 such as 
       
  1042 $\SEQ r_1 \cdot (\SEQ r_1 \cdot r_3) \rightarrow
       
  1043 \SEQs [r_1, r_2, r_3]$.
       
  1044 This does not fit with the proof technique
       
  1045 of our main theorem, but seem to not violate the POSIX
       
  1046 property.\\
  1010 Having correctness property is good. 
  1047 Having correctness property is good. 
  1011 But we would also a guarantee that the lexer is not slow in 
  1048 But we would also a guarantee that the lexer is not slow in 
  1012 some sense, for exampe, not grinding to a halt regardless of the input.
  1049 some sense, for exampe, not grinding to a halt regardless of the input.
  1013 As we have already seen, Sulzmann and Lu's simplification function
  1050 As we have already seen, Sulzmann and Lu's simplification function
  1014 $\simpsulz$ cannot achieve this, because their claim that
  1051 $\simpsulz$ cannot achieve this, because their claim that