lexing: comparison ChengsongTanPhdThesis/Chapters/Bitcoded2.tex

equal deleted inserted replaced

-:dd9dde2d902b
+:80cc6dc4c98b
 In this chapter we introduce simplifications
 for annotated regular expressions that can be applied to
 each intermediate derivative result. This allows
 us to make $\blexer$ much more efficient.
 Sulzmann and Lu already introduced some simplifications for bitcoded regular expressions,
-but their simplification functions  were inefficient and in some cases needs fixing.
+but their simplification functions  were inefficient and in some cases needed fixing.
 %We contrast our simplification function
 %with Sulzmann and Lu's, indicating the simplicity of our algorithm.
 %This is another case for the usefulness
 %and reliability of formal proofs on algorithms.
 %These ``aggressive'' simplifications would not be possible in the injection-based
 \begin{center}
 $((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*$
 \end{center}
 because the duplicates are
 not next to each other and therefore the rule
-$r+ r \rightarrow r$ does not fire.
+$r+ r \rightarrow r$ from $\textit{simp}$ does not fire.
 One would expect a better simplification function to work in the
 following way:
 \begin{gather*}
 	((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* +
 	\underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}.\\
 		$\textit{simp}\_{SL}  \; _{bs}\sum []$ & $\dn$ & $\ZERO$\\
 		$\textit{simp}\_{SL}  \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ &
 		$_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\
 		$\textit{simp}\_{SL}  \; _{bs}\sum[r]$ & $\dn$ & $\fuse \; bs \; (\textit{simp}\_{SL}  \; r)$\\
 		$\textit{simp}\_{SL}  \; _{bs}\sum(r::rs)$ & $\dn$ & $_{bs}\sum
-		(\nub \; (\filter \; (\not \circ \zeroable)\;((\textit{simp}\_{SL}  \; r) :: \map \; \textit{simp}\_{SL}  \; rs)))$\\
+		(\nub \; (\filter \; (\neg\zeroable)\;((\textit{simp}\_{SL}  \; r) :: \map \; \textit{simp}\_{SL}  \; rs)))$\\
 	\end{tabular}
 \end{center}
 \noindent
 The $\textit{zeroable}$ predicate
 \end{tabular}
 \end{center}
 \noindent
 We implemented this lexing algorithm in Scala,
 and found that the final derivative regular expression
-size grows exponentially fast (note the logarithmic scale):
+size still grows exponentially (note the logarithmic scale):
 \begin{figure}[H]
 	\centering
 \begin{tikzpicture}
 \begin{axis}[
 xlabel={$n$},
 $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}
 $ using Sulzmann and Lu's lexer}\label{SulzmannLuLexerTime}
 \end{figure}
 \noindent
 which seems like a counterexample for
-their linear complexity claim:
+Sulzmann and Lu's linear complexity claim
+in their paper \cite{Sulzmann2014}:
 \begin{quote}\it
-Linear-Time Complexity Claim \\It is easy to see that each call of one of the functions/operations:
+``Linear-Time Complexity Claim \\It is easy to see that each call of one of the functions/operations:
-simp, fuse, mkEpsBC and isPhi leads to subcalls whose number is bound by the size of the regular expression involved. We claim that thanks to aggressively applying simp this size remains finite. Hence, we can argue that the above mentioned functions/operations have constant time complexity which implies that we can incrementally compute bit-coded parse trees in linear time in the size of the input.
+simp, fuse, mkEpsBC and isPhi leads to subcalls whose number is bound by the size of the regular expression involved. We claim that thanks to aggressively applying simp this size remains finite. Hence, we can argue that the above mentioned functions/operations have constant time complexity which implies that we can incrementally compute bit-coded parse trees in linear time in the size of the input.''
 \end{quote}
 \noindent
 The assumption that the size of the regular expressions
 in the algorithm
-would stay below a finite constant is not true, not in the
+would stay below a finite constant is not true, at least not in the
 examples we considered.
-The main reason behind this is that (i) the $\textit{nub}$
+The main reason behind this is that (i) Haskell's $\textit{nub}$
 function requires identical annotations between two
 annotated regular expressions to qualify as duplicates,
 and therefore cannot simplify cases like $_{SZZ}a^*+_{SZS}a^*$
 even if both $a^*$ denote the same language, and
 (ii) the ``flattening'' only applies to the head of the list
 		$_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\
 	\end{tabular}
 \end{center}
 \noindent
 clause, and therefore is not strong enough to simplify all
-needed parts of the regular expression. Moreover, even if the regular expressions size
+needed parts of the regular expression. Moreover,
-do stay finite, one has to take into account that
+the $\textit{simp}\_{SL}$ function is applied repeatedly
-the $\textit{simp}\_{SL}$ function is applied many times
+in each derivative step until a fixed point is reached,
-in each derivative step, and that number is not necessarily
+which makes the algorithm even more
-a constant with respect to the size of the regular expression.
+unpredictable and inefficient.
 %To not get ``caught off guard'' by
 %these counterexamples,
 %one needs to be more careful when designing the
 %simplification function and making claims about them.
 \section{Our $\textit{Simp}$ Function}
-We will now introduce our simplification function.
+We will now introduce our own simplification function.
 %by making a contrast with $\textit{simp}\_{SL}$.
 We also describe
 the ideas behind Sulzmann and Lu's $\textit{simp}\_{SL}$
 algorithm
-and why it fails to achieve the desired effect.
+and why it fails to achieve the desired effect of keeping the sizes of derivatives finitely bounded.
-Our simplification function comes with a formal
+In addition, our simplification function will come with a formal
 correctness proof.
 \subsection{Flattening Nested Alternatives}
 The idea behind the clause
 \begin{center}
 	$\textit{simp}\_{SL}  \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2) \quad \dn \quad
 simplify
 \begin{center}
 	$((a^* a^*)+\underline{(a^* + a^*)})\cdot (a^*a^*)^*+
 ((a^*a^*)+a^*)\cdot (a^*a^*)^*$
 \end{center}
-due to the underlined part not being in the head
+due to the underlined part not being the head
 of the alternative.
 We define our flatten operation so that it flattens
 the entire list:
 \begin{center}
 \end{center}
 \noindent
 The simplification (named $\textit{bsimp}$ for \emph{b}it-coded)
 does a pattern matching on the regular expression.
-When it detected that the regular expression is an alternative or
+When it detects that the regular expression is an alternative or
 sequence, it will try to simplify its children regular expressions
 recursively and then see if one of the children turns into $\ZERO$ or
 $\ONE$, which might trigger further simplification at the current level.
 Current level simplifications are handled by the function $\textit{bsimp}_{ASEQ}$,
 using rules such as  $\ZERO \cdot r \rightarrow \ZERO$ and $\ONE \cdot r \rightarrow r$.
 derivatives with simplifications from our $\simp$ function
 is called $\blexersimp$:
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{blexer\_simp}\;r\,s$ & $\dn$ &
-$\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\
+$\textit{let}\;a = (r^\uparrow)\backslash_{bsimp}\, s\;\textit{in}$\\
 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
 & & $\;\;\textit{else}\;\textit{None}$
 \end{tabular}
 \end{center}
+\noindent
-\noindent
+This algorithm keeps the regular expression size small,
-This algorithm keeps the regular expression size small.
+as we shall demonstrate with some examples in the next section.
 \subsection{Examples $(a+aa)^*$ and $(a^*\cdot a^*)^*$
 After Simplification}
 Recall the
 \addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data};
 \end{axis}
 \end{tikzpicture}
 \end{tabular}
 \end{center}
-\caption{Our Improvement over Sulzmann and Lu's in terms of size}
+\caption{Our Improvement over Sulzmann and Lu's in terms of size of the derivatives.}
 \end{figure}
 \noindent
 Given the size difference, it is not
 surprising that our $\blexersimp$ significantly outperforms
-$\textit{blexer\_SLSimp}$.
+$\textit{blexer\_SLSimp}$ by Sulzmann and Lu.
 In the next section we are going to establish that our
 simplification preserves the correctness of the algorithm.
 %----------------------------------------------------------------------------------------
 %	SECTION rewrite relation
 %----------------------------------------------------------------------------------------
 \noindent
 The rules $LT$ and $LH$ are for rewriting two regular expression lists
 such that one regular expression
 in the left-hand-side list is rewritable in one step
 to the right-hand-side's regular expression at the same position.
-This helps with defining the ``context rule'' $AL$.\\
+This helps with defining the ``context rule'' $AL$.
 The reflexive transitive closure of $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$
 are defined in the usual way:
 \begin{figure}[H]
 	\centering
 \begin{mathpar}
 		\implies \bmkeps \; r = \bmkeps \; r'$}
 Intuitively, this property says we can
 extract the same bitcodes using $\bmkeps$ from the nullable
 components of two regular expressions $r$ and $r'$,
 if we can rewrite from one to the other in finitely
-many steps.\\
+many steps.
 For convenience,
 we define a predicate for a list of regular expressions
 having at least one nullable regular expressions:
 \begin{center}
 	$\textit{bnullables} \; rs \quad \dn \quad \exists r \in rs. \;\; \bnullable \; r$
 \subsubsection{Property 2: $r \stackrel{*}{\rightsquigarrow} \textit{bsimp} \; r$}
 Now we get to the key part of the proof,
 which says that our simplification's helper functions
 such as $\distinctBy$ and $\flts$ describe
 reducts of $\stackrel{s*}{\rightsquigarrow}$ and
-$\rightsquigarrow^* $.\\
+$\rightsquigarrow^* $.
 The first lemma to prove is a more general version of
 $rs_ 1 \rightsquigarrow^* \distinctBy \; rs_1 \; \phi$:
 \begin{lemma}
 	$rs_1 @ rs_2 \stackrel{s*}{\rightsquigarrow}
 	(rs_1 @ (\distinctBy \; rs_2 \; \; \rerases \;\; (\map\;\; \rerases \; \; rs_1)))$
 \begin{proof}
 	We can rewrite in many steps from the original lexer's
 	derivative regular expressions to the
 	lexer with simplification applied (by lemma \ref{bderBderssimp}):
 	\begin{center}
-		$a \backslash s \stackrel{*}{\rightsquigarrow} \bderssimp{a}{s} $.
+		$a \backslash s \rightsquigarrow^* \bderssimp{a}{s} $.
 	\end{center}
 	We know that they generate the same bits, if the lexing result is a match:
 	\begin{center}
 		$\bnullable \; (a \backslash s)
 		\implies \bmkeps \; (a \backslash s) = \bmkeps \; (\bderssimp{a}{s})$
 \begin{center}
 	$(r, s) \rightarrow v \;\; \textit{iff}\;\; \blexer \; r \; s = \Some \;v$\\
 	$\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer\;
 	r\;s = \None$.
 \end{center}
-and obtain the corollary that the bit-coded lexer with simplification is
+and obtain the property that the bit-coded lexer with simplification is
-indeed correctly outputting POSIX lexing result, if such a result exists.
+indeed correctly generating a POSIX lexing result, if such a result exists.
 \begin{corollary}
 	$(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp \; r\; s = \Some \; v$\\
 	$\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp\;
 	r\;s = \None$.
 \end{corollary}
 in \cref{flex_retrieve}.
 The problem is that both functions $\inj$ and $\retrieve$ require
 that the annotated regular expressions stay unsimplified,
 so that one can
 correctly compare $v_{i+1}$ and $r_i$  and $v_i$
-in diagram \ref{graph:inj} and
+in diagram \ref{graph:inj}.
-``fit the key into the lock hole''.
-\noindent
 We also tried to prove
 \begin{center}
 $\textit{bsimp} \;\; (\bderssimp{a}{s}) =
 \textit{bsimp} \;\;  (a\backslash s)$,
 \end{center}
 but this turns out to be not true.
-A counterexample would be
+A counterexample is
 \[ a = [(_{Z}1+_{S}c)\cdot [bb \cdot (_{Z}1+_{S}c)]] \;\;
 	\text{and} \;\; s = bb.
 \]
 \noindent
 Then we would have
 Unfortunately,
 if we apply $\textit{bsimp}$ differently
 we will always have this discrepancy.
 This is due to
 the $\map \; (\fuse\; bs) \; as$ operation
-happening at different locations in the regular expression.\\
+happening at different locations in the regular expression.
 The rewriting relation
 $\rightsquigarrow^*$
 allows us to ignore this discrepancy
 and view the expressions
 \begin{center}
 	and\\
 	$_{Z}(_{Z} \ONE + _{S} c)$
 \end{center}
 as equal, because they were both re-written
-from the same expression.\\
+from the same expression.
 The simplification rewriting rules
 given in \ref{rrewriteRules} are by no means
 final,
 one could come up new rules
 such as

changeset 639	80cc6dc4c98b
parent 624	8ffa28fce271
child 640	bd1354127574