lexing: comparison ChengsongTanPhdThesis/Chapters/Bitcoded2.tex

equal deleted inserted replaced

-:a5f666410101
+:fd068f39ac23
 %its correctness proof in
 %Chapter 3\ref{Chapter3}.
-In this chapter we introduce the simplifications
+In this chapter we introduce simplifications
 on annotated regular expressions that can be applied to
 each intermediate derivative result. This allows
 us to make $\blexer$ much more efficient.
-We contrast this simplification function
+Sulzmann and Lu already had some bit-coded simplifications,
-with Sulzmann and Lu's original
+but their simplification functions  were inefficient.
-simplifications, indicating the simplicity of our algorithm and
+We contrast our simplification function
-improvements we made, demostrating
+with Sulzmann and Lu's, indicating the simplicity of our algorithm.
-the usefulness and reliability of formal proofs on algorithms.
+This is another case for the usefulness
+and reliability of formal proofs on algorithms.
 These ``aggressive'' simplifications would not be possible in the injection-based
 lexing we introduced in chapter \ref{Inj}.
-We then go on to prove the correctness with the improved version of
+We then prove the correctness with the improved version of
 $\blexer$, called $\blexersimp$, by establishing
 $\blexer \; r \; s= \blexersimp \; r \; s$ using a term rewriting system.
 \section{Simplifications by Sulzmann and Lu}
-The first thing we notice in the fast growth of examples such as $(a^*a^*)^*$'s
+Consider the derivatives of examples such as $(a^*a^*)^*$
-and $(a^* + (aa)^*)^*$'s derivatives is that a lot of duplicated sub-patterns
+and $(a^* + (aa)^*)^*$:
-are scattered around different levels, and therefore requires
-de-duplication at different levels:
 \begin{center}
 	$(a^*a^*)^* \stackrel{\backslash a}{\longrightarrow} (a^*a^* + a^*)\cdot(a^*a^*)^* \stackrel{\backslash a}{\longrightarrow} $\\
 	$((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^* \stackrel{\backslash a}{\longrightarrow} \ldots$
 \end{center}
 \noindent
-As we have already mentioned in \ref{eqn:growth2},
+As can be seen, there is a lot of duplication
-a simple-minded simplification function cannot simplify
+in the example we have already mentioned in
+\ref{eqn:growth2}.
+A simple-minded simplification function cannot simplify
 the third regular expression in the above chain of derivative
-regular expressions:
+regular expressions, namely
 \begin{center}
 $((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*$
 \end{center}
-one would expect a better simplification function to work in the
+because the duplicates are
+not next to each other and therefore the rule
+$r+ r \rightarrow r$ does not fire.
+One would expect a better simplification function to work in the
 following way:
 \begin{gather*}
 	((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* +
 	\underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}.\\
 	\bigg\downarrow \\
 	\bigg\downarrow \\
 	(a^*a^* + a^*
 	)\cdot(a^*a^*)^*
 \end{gather*}
 \noindent
-This motivating example came from testing Sulzmann and Lu's
+In the first step, the nested alternative regular expression
+$(a^*a^* + a^*) + a^*$ is flattened into $a^*a^* + a^* + a^*$.
+Now the third term $a^*$ is clearly identified as a duplicate
+and therefore removed in the second step. This causes the two
+top-level terms to become the same and the second $(a^*a^*+a^*)\cdot(a^*a^*)^*$
+removed in the final step.\\
+This motivating example is from testing Sulzmann and Lu's
 algorithm: their simplification does
 not work!
-We quote their $\textit{simp}$ function verbatim here:
+Consider their simplification (using our notations):
 \begin{center}
 	\begin{tabular}{lcl}
 		$\simpsulz \; _{bs}(_{bs'}\ONE \cdot r)$ & $\dn$ &
 		$\textit{if} \; (\textit{zeroable} \; r)\; \textit{then} \;\; \ZERO$\\
 						   & &$\textit{else}\;\; \fuse \; (bs@ bs') \; r$\\
 		(\nub \; (\filter \; (\not \circ \zeroable)\;((\simpsulz  \; r) :: \map \; \simpsulz  \; rs)))$\\
 	\end{tabular}
 \end{center}
 \noindent
-the $\textit{zeroable}$ predicate
+where the $\textit{zeroable}$ predicate
-which tests whether the regular expression
+tests whether the regular expression
 is equivalent to $\ZERO$,
-is defined as:
+can be defined as:
 \begin{center}
 	\begin{tabular}{lcl}
 		$\zeroable \; _{bs}\sum (r::rs)$ & $\dn$ & $\zeroable \; r\;\; \land \;\;
 		\zeroable \;_{[]}\sum\;rs $\\
 		$\zeroable\;_{bs}(r_1 \cdot r_2)$ & $\dn$ & $\zeroable\; r_1 \;\; \lor \;\; \zeroable \; r_2$\\
 \end{quote}
 \noindent
 The assumption that the size of the regular expressions
 in the algorithm
 would stay below a finite constant is not ture.
+The main reason behind this is that (i) The $\textit{nub}$
+function requires identical annotations between two
+annotated regular expressions to qualify as duplicates,
+and cannot simplify the cases like $_{SZZ}a^*+_{SZS}a^*$
+even if both $a^*$ denote the same language.
+(ii) The ``flattening'' only applies to the head of the list
+in the
+\begin{center}
+	\begin{tabular}{lcl}
+		$\simpsulz  \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ &
+		$_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\
+	\end{tabular}
+\end{center}
+\noindent
+clause, and therefore is not thorough enough to simplify all
+needed parts of the regular expression.\\
 In addition to that, even if the regular expressions size
 do stay finite, one has to take into account that
 the $\simpsulz$ function is applied many times
 in each derivative step, and that number is not necessarily
 a constant with respect to the size of the regular expression.
 the ideas behind components in their algorithm
 and why they fail to achieve the desired effect, followed
 by our solution. These solutions come with correctness
 statements that are backed up by formal proofs.
 \subsection{Flattening Nested Alternatives}
-The idea behind the
+The idea behind the clause
 \begin{center}
 $\simpsulz  \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2) \quad \dn \quad
 	       _{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$
 \end{center}
-clause is that it allows
+is that it allows
 duplicate removal of regular expressions at different
-levels.
+``levels'' of alternatives.
 For example, this would help with the
 following simplification:
 \begin{center}
 $(a+r)+r \longrightarrow a+r$
 \end{center}
 The problem here is that only the head element
 is ``spilled out'',
 whereas we would want to flatten
-an entire list to open up possibilities for further simplifications.\\
+an entire list to open up possibilities for further simplifications.
 Not flattening the rest of the elements also means that
 the later de-duplication processs
-does not fully remove apparent duplicates.
+does not fully remove further duplicates.
 For example,
 using $\simpsulz$ we could not
 simplify
 \begin{center}
-$((a^* a^*)+ (a^* + a^*))\cdot (a^*a^*)^*+
+	$((a^* a^*)+\underline{(a^* + a^*)})\cdot (a^*a^*)^*+
 ((a^*a^*)+a^*)\cdot (a^*a^*)^*$
 \end{center}
 due to the underlined part not in the first element
 of the alternative.\\
 We define a flatten operation that flattens not only
 \noindent
 This algorithm keeps the regular expression size small.
-\subsection{$(a+aa)^*$ and $(a^*\cdot a^*)^*$  against
+\subsection{Examples $(a+aa)^*$ and $(a^*\cdot a^*)^*$
-$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ After Simplification}
+After Simplification}
-For example,
+Recall the
-with our simplification the
 previous $(a^*a^*)^*$ example
 where $\simpsulz$ could not
-stop the fast growth (over
+prevent the fast growth (over
 3 million nodes just below $20$ input length)
-will be reduced to just 15 and stays constant, no matter how long the
+will be reduced to just 15 and stays constant no matter how long the
 input string is.
-This is demonstrated in the graphs below.
+This is shown in the graphs below.
 \begin{figure}[H]
 \begin{center}
 \begin{tabular}{ll}
 \begin{tikzpicture}
 \begin{axis}[
 %	SECTION rewrite relation
 %----------------------------------------------------------------------------------------
 \section{Correctness of $\blexersimp$}
 In this section we give details
 of the correctness proof of $\blexersimp$,
-an important contribution of this thesis.\\
+one of the contributions of this thesis.\\
 We first introduce the rewriting relation \emph{rrewrite}
 ($\rrewrite$) between two regular expressions,
 which expresses an atomic
-simplification step from the left-hand-side
+simplification.
-to the right-hand-side.
 We then prove properties about
 this rewriting relation and its reflexive transitive closure.
 Finally we leverage these properties to show
 an equivalence between the internal data structures of
 $\blexer$ and $\blexersimp$.
 \subsection{The Rewriting Relation $\rrewrite$($\rightsquigarrow$)}
 In the $\blexer$'s correctness proof, we
-did not directly derive the fact that $\blexer$ gives out the POSIX value,
+did not directly derive the fact that $\blexer$ generates the POSIX value,
 but first proved that $\blexer$ is linked with $\lexer$.
 Then we re-use
 the correctness of $\lexer$
 to obtain
 \begin{center}
 by first proving that
 $\blexersimp \; r \; s $
 produces the same output as $\blexer \; r\; s$,
 and then piecing it together with
 $\blexer$'s correctness to achieve our main
-theorem:\footnote{ the case when
+theorem:\footnote{ The case when
-$s$ is not in $L \; r$, is routine to establish }
+$s$ is not in $L \; r$, is routine to establish.}
 \begin{center}
 	$(r, s) \rightarrow v \; \;   \textit{iff} \;\;  \blexersimp \; r \; s = v$
 \end{center}
 \noindent
 The overall idea for the proof
 \begin{center}
 	$r \rightsquigarrow^* \textit{bsimp} \; r$
 \end{center}
 where each rewrite step, written $\rightsquigarrow$,
 is an ``atomic'' simplification that
-cannot be broken down any further:
+is similar to a small-step reduction in operational semantics:
 \begin{figure}[H]
 \begin{mathpar}
 	\inferrule * [Right = $S\ZERO_l$]{\vspace{0em}}{_{bs} \ZERO \cdot r_2 \rightsquigarrow \ZERO\\}
 	\inferrule * [Right = $S\ZERO_r$]{\vspace{0em}}{_{bs} r_1 \cdot \ZERO \rightsquigarrow \ZERO\\}
 	\inferrule{rs_1 \stackrel{s*}{\rightsquigarrow}  rs_2 \land \; rs_2 \stackrel{s*}{\rightsquigarrow} rs_3}{rs_1 \stackrel{s*}{\rightsquigarrow} rs_3}
 \end{mathpar}
 \caption{The Reflexive Transitive Closure of
 $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$}\label{transClosure}
 \end{figure}
-Two rewritable terms will remain rewritable to each other
+%Two rewritable terms will remain rewritable to each other
-even after a derivative is taken:
+%even after a derivative is taken:
+Rewriting is preserved under derivatives,
+namely
 \begin{center}
 	$r_1 \rightsquigarrow r_2 \implies (r_1 \backslash c) \rightsquigarrow^* (r_2 \backslash c)$
 \end{center}
 And finally, if two terms are rewritable to each other,
 then they produce the same bitcodes:
 	$(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp \; r\; s $
 \end{corollary}
 \subsection{Comments on the Proof Techniques Used}
 Straightforward and simple as the proof may seem,
-the efforts we spent obtaining it was far from trivial.\\
+the efforts we spent obtaining it were far from trivial.\\
 We initially attempted to re-use the argument
 in \cref{flex_retrieve}.
 The problem was that both functions $\inj$ and $\retrieve$ require
 that the annotated regular expressions stay unsimplified,
 so that one can
 	$_{Z}(_{Z} \ONE + _{S} c)$
 \end{center}
 as equal, because they were both re-written
 from the same expression.\\
+The simplification rewriting rules
+given in \ref{rrewriteRules} are by no means
+final,
+one could come up new rules
+such as
+$\SEQ r_1 \cdot (\SEQ r_1 \cdot r_3) \rightarrow
+\SEQs [r_1, r_2, r_3]$.
+This does not fit with the proof technique
+of our main theorem, but seem to not violate the POSIX
+property.\\
 Having correctness property is good.
 But we would also a guarantee that the lexer is not slow in
 some sense, for exampe, not grinding to a halt regardless of the input.
 As we have already seen, Sulzmann and Lu's simplification function
 $\simpsulz$ cannot achieve this, because their claim that

changeset 600	fd068f39ac23
parent 591	b2d0de6aee18
child 601	ce4e5151a836