lexing: comparison ChengsongTanPhdThesis/Chapters/Bitcoded2.tex

equal deleted inserted replaced

-:c0c1ebe09c7d
+:8ffa28fce271
 %Chapter 3\ref{Chapter3}.
 In this chapter we introduce simplifications
-on annotated regular expressions that can be applied to
+for annotated regular expressions that can be applied to
 each intermediate derivative result. This allows
 us to make $\blexer$ much more efficient.
-Sulzmann and Lu already had some bit-coded simplifications,
+Sulzmann and Lu already introduced some simplifications for bitcoded regular expressions,
-but their simplification functions  were inefficient.
+but their simplification functions  were inefficient and in some cases needs fixing.
-We contrast our simplification function
+%We contrast our simplification function
-with Sulzmann and Lu's, indicating the simplicity of our algorithm.
+%with Sulzmann and Lu's, indicating the simplicity of our algorithm.
-This is another case for the usefulness
+%This is another case for the usefulness
-and reliability of formal proofs on algorithms.
+%and reliability of formal proofs on algorithms.
-These ``aggressive'' simplifications would not be possible in the injection-based
+%These ``aggressive'' simplifications would not be possible in the injection-based
-lexing we introduced in chapter \ref{Inj}.
+%lexing we introduced in chapter \ref{Inj}.
-We then prove the correctness with the improved version of
+%We then prove the correctness with the improved version of
-$\blexer$, called $\blexersimp$, by establishing
+%$\blexer$, called $\blexersimp$, by establishing
-$\blexer \; r \; s= \blexersimp \; r \; s$ using a term rewriting system.
+%$\blexer \; r \; s= \blexersimp \; r \; s$ using a term rewriting system.
+%
 \section{Simplifications by Sulzmann and Lu}
-Consider the derivatives of examples such as $(a^*a^*)^*$
+Consider the derivatives of the following example $(a^*a^*)^*$:
-and $(a^* + (aa)^*)^*$:
+%and $(a^* + (aa)^*)^*$:
 \begin{center}
-	$(a^*a^*)^* \stackrel{\backslash a}{\longrightarrow} (a^*a^* + a^*)\cdot(a^*a^*)^* \stackrel{\backslash a}{\longrightarrow} $\\
+	\begin{tabular}{lcl}
-	$((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^* \stackrel{\backslash a}{\longrightarrow} \ldots$
+		$(a^*a^*)^*$ & $ \stackrel{\backslash a}{\longrightarrow}$ &
-\end{center}
+		$ (a^*a^* + a^*)\cdot(a^*a^*)^*$\\
-\noindent
+			     &
-As can be seen, there is a lot of duplication
+		$ \stackrel{\backslash a}{\longrightarrow} $ &
-in the example we have already mentioned in
+	$((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*$\\
-\ref{eqn:growth2}.
+							     & $\stackrel{\backslash a}{
+	\longrightarrow} $ & $\ldots$\\
+	\end{tabular}
+\end{center}
+\noindent
+As can be seen, there are serveral duplications.
 A simple-minded simplification function cannot simplify
 the third regular expression in the above chain of derivative
 regular expressions, namely
 \begin{center}
 $((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*$
 One would expect a better simplification function to work in the
 following way:
 \begin{gather*}
 	((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* +
 	\underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}.\\
-	\bigg\downarrow \\
+	\bigg\downarrow (1) \\
 	(a^*a^* + a^*
 	\color{gray} + a^* \color{black})\cdot(a^*a^*)^* +
 	\underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this} \\
-	\bigg\downarrow \\
+	\bigg\downarrow (2) \\
 	(a^*a^* + a^*
 	)\cdot(a^*a^*)^*
 	\color{gray} + (a^*a^* + a^*) \cdot(a^*a^*)^*\\
-	\bigg\downarrow \\
+	\bigg\downarrow (3) \\
 	(a^*a^* + a^*
 	)\cdot(a^*a^*)^*
 \end{gather*}
 \noindent
 In the first step, the nested alternative regular expression
 $(a^*a^* + a^*) + a^*$ is flattened into $a^*a^* + a^* + a^*$.
-Now the third term $a^*$ is clearly identified as a duplicate
+Now the third term $a^*$ can clearly be identified as a duplicate
-and therefore removed in the second step. This causes the two
+and therefore removed in the second step.
+This causes the two
 top-level terms to become the same and the second $(a^*a^*+a^*)\cdot(a^*a^*)^*$
-removed in the final step.\\
+removed in the final step.
-This motivating example is from testing Sulzmann and Lu's
+Sulzmann and Lu's simplification function (using our notations) can achieve this
-algorithm: their simplification does
+simplification:
-not work!
-Consider their simplification (using our notations):
 \begin{center}
 	\begin{tabular}{lcl}
-		$\simpsulz \; _{bs}(_{bs'}\ONE \cdot r)$ & $\dn$ &
+		$\textit{simp}\_{SL} \; _{bs}(_{bs'}\ONE \cdot r)$ & $\dn$ &
 		$\textit{if} \; (\textit{zeroable} \; r)\; \textit{then} \;\; \ZERO$\\
 						   & &$\textit{else}\;\; \fuse \; (bs@ bs') \; r$\\
-		$\simpsulz \;(_{bs}r_1\cdot r_2)$ & $\dn$ & $\textit{if}
+		$\textit{simp}\_{SL} \;(_{bs}r_1\cdot r_2)$ & $\dn$ & $\textit{if}
 		\; (\textit{zeroable} \; r_1 \; \textit{or} \; \textit{zeroable}\; r_2)\;
 		\textit{then} \;\; \ZERO$\\
-					     & & $\textit{else}\;\;_{bs}((\simpsulz \;r_1)\cdot
+							    & & $\textit{else}\;\;_{bs}((\textit{simp}\_{SL} \;r_1)\cdot
-					     (\simpsulz \; r_2))$\\
+							    (\textit{simp}\_{SL} \; r_2))$\\
-		$\simpsulz  \; _{bs}\sum []$ & $\dn$ & $\ZERO$\\
+		$\textit{simp}\_{SL}  \; _{bs}\sum []$ & $\dn$ & $\ZERO$\\
-		$\simpsulz  \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ &
+		$\textit{simp}\_{SL}  \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ &
 		$_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\
-		$\simpsulz  \; _{bs}\sum[r]$ & $\dn$ & $\fuse \; bs \; (\simpsulz  \; r)$\\
+		$\textit{simp}\_{SL}  \; _{bs}\sum[r]$ & $\dn$ & $\fuse \; bs \; (\textit{simp}\_{SL}  \; r)$\\
-		$\simpsulz  \; _{bs}\sum(r::rs)$ & $\dn$ & $_{bs}\sum
+		$\textit{simp}\_{SL}  \; _{bs}\sum(r::rs)$ & $\dn$ & $_{bs}\sum
-		(\nub \; (\filter \; (\not \circ \zeroable)\;((\simpsulz  \; r) :: \map \; \simpsulz  \; rs)))$\\
+		(\nub \; (\filter \; (\not \circ \zeroable)\;((\textit{simp}\_{SL}  \; r) :: \map \; \textit{simp}\_{SL}  \; rs)))$\\
 	\end{tabular}
 \end{center}
 \noindent
-where the $\textit{zeroable}$ predicate
+The $\textit{zeroable}$ predicate
 tests whether the regular expression
-is equivalent to $\ZERO$,
+is equivalent to $\ZERO$, and
 can be defined as:
 \begin{center}
 	\begin{tabular}{lcl}
 		$\zeroable \; _{bs}\sum (r::rs)$ & $\dn$ & $\zeroable \; r\;\; \land \;\;
 		\zeroable \;_{[]}\sum\;rs $\\
 		$\zeroable\;_{bs}\ONE$ & $\dn$ & $\textit{false}$\\
 		$\zeroable\;_{bs}\ZERO$ & $\dn$ & $\textit{true}$
 	\end{tabular}
 \end{center}
 \noindent
-They suggested that the $\simpsulz $ function should be
+The
+\begin{center}
+	\begin{tabular}{lcl}
+		$\textit{simp}\_{SL}  \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ &
+		$_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\
+	\end{tabular}
+\end{center}
+\noindent
+clause does flatten the alternative as required in step (1),
+but $\textit{simp}\_{SL}$ is insufficient if we want to do steps (2) and (3),
+as these ``identical'' terms have different bit-annotations.
+They also suggested that the $\textit{simp}\_{SL} $ function should be
 applied repeatedly until a fixpoint is reached.
-We call this construction $\textit{sulzSimp}$:
+We call this construction $\textit{SLSimp}$:
 \begin{center}
 	\begin{tabular}{lcl}
-		$\textit{sulzSimp} \; r$ & $\dn$ &
+		$\textit{SLSimp} \; r$ & $\dn$ &
-		$\textit{while}((\simpsulz  \; r)\; \cancel{=} \; r)$ \\
+		$\textit{while}((\textit{simp}\_{SL}  \; r)\; \cancel{=} \; r)$ \\
-		& & $\quad r := \simpsulz  \; r$\\
+					 & & $\quad r := \textit{simp}\_{SL}  \; r$\\
 		& & $\textit{return} \; r$
 	\end{tabular}
 \end{center}
 We call the operation of alternatingly
 applying derivatives and simplifications
 (until the string is exhausted) Sulz-simp-derivative,
-written $\backslash_{sulzSimp}$:
+written $\backslash_{SLSimp}$:
 \begin{center}
 \begin{tabular}{lcl}
-	$r \backslash_{sulzSimp} (c\!::\!s) $ & $\dn$ & $(\textit{sulzSimp} \; (r \backslash c)) \backslash_{sulzSimp}\, s$ \\
+	$r \backslash_{SLSimp} (c\!::\!s) $ & $\dn$ & $(\textit{SLSimp} \; (r \backslash c)) \backslash_{SLSimp}\, s$ \\
-$r \backslash_{sulzSimp} [\,] $ & $\dn$ & $r$
+$r \backslash_{SLSimp} [\,] $ & $\dn$ & $r$
 \end{tabular}
 \end{center}
 \noindent
 After the derivatives have been taken, the bitcodes
 are extracted and decoded in the same manner
 as $\blexer$:
 \begin{center}
 \begin{tabular}{lcl}
-$\textit{blexer\_sulzSimp}\;r\,s$ & $\dn$ &
+$\textit{blexer\_SLSimp}\;r\,s$ & $\dn$ &
-$\textit{let}\;a = (r^\uparrow)\backslash_{sulzSimp}\, s\;\textit{in}$\\
+$\textit{let}\;a = (r^\uparrow)\backslash_{SLSimp}\, s\;\textit{in}$\\
 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
 & & $\;\;\textit{else}\;\textit{None}$
 \end{tabular}
 \end{center}
 \noindent
 We implemented this lexing algorithm in Scala,
 and found that the final derivative regular expression
-size grows exponentially fast:
+size grows exponentially fast (note the logarithmic scale):
 \begin{figure}[H]
 	\centering
 \begin{tikzpicture}
 \begin{axis}[
 xlabel={$n$},
 $ using Sulzmann and Lu's lexer}\label{SulzmannLuLexer}
 \end{figure}
 \noindent
 At $n= 20$ we already get an out of memory error with Scala's normal
 JVM heap size settings.
-In fact their simplification does not improve over
+In fact their simplification does not improve much over
 the simple-minded simplifications we have shown in \ref{fig:BetterWaterloo}.
 The time required also grows exponentially:
 \begin{figure}[H]
 	\centering
 \begin{tikzpicture}
 simp, fuse, mkEpsBC and isPhi leads to subcalls whose number is bound by the size of the regular expression involved. We claim that thanks to aggressively applying simp this size remains finite. Hence, we can argue that the above mentioned functions/operations have constant time complexity which implies that we can incrementally compute bit-coded parse trees in linear time in the size of the input.
 \end{quote}
 \noindent
 The assumption that the size of the regular expressions
 in the algorithm
-would stay below a finite constant is not ture.
+would stay below a finite constant is not true, not in the
-The main reason behind this is that (i) The $\textit{nub}$
+examples we considered.
+The main reason behind this is that (i) the $\textit{nub}$
 function requires identical annotations between two
 annotated regular expressions to qualify as duplicates,
-and cannot simplify the cases like $_{SZZ}a^*+_{SZS}a^*$
+and therefore cannot simplify cases like $_{SZZ}a^*+_{SZS}a^*$
-even if both $a^*$ denote the same language.
+even if both $a^*$ denote the same language, and
-(ii) The ``flattening'' only applies to the head of the list
+(ii) the ``flattening'' only applies to the head of the list
 in the
 \begin{center}
 	\begin{tabular}{lcl}
-		$\simpsulz  \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ &
+		$\textit{simp}\_{SL}  \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ &
 		$_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\
 	\end{tabular}
 \end{center}
 \noindent
-clause, and therefore is not thorough enough to simplify all
+clause, and therefore is not strong enough to simplify all
-needed parts of the regular expression.\\
+needed parts of the regular expression. Moreover, even if the regular expressions size
-In addition to that, even if the regular expressions size
 do stay finite, one has to take into account that
-the $\simpsulz$ function is applied many times
+the $\textit{simp}\_{SL}$ function is applied many times
 in each derivative step, and that number is not necessarily
 a constant with respect to the size of the regular expression.
-To not get ``caught off guard'' by
+%To not get ``caught off guard'' by
-these counterexamples,
+%these counterexamples,
-one needs to be more careful when designing the
+%one needs to be more careful when designing the
-simplification function and making claims about them.
+%simplification function and making claims about them.
 \section{Our $\textit{Simp}$ Function}
-We will now introduce our simplification function,
+We will now introduce our simplification function.
-by making a contrast with $\simpsulz$.
+%by making a contrast with $\textit{simp}\_{SL}$.
-We describe
+We also describe
-the ideas behind components in their algorithm
+the ideas behind Sulzmann and Lu's $\textit{simp}\_{SL}$
-and why they fail to achieve the desired effect, followed
+algorithm
-by our solution. These solutions come with correctness
+and why it fails to achieve the desired effect.
-statements that are backed up by formal proofs.
+Our simplification function comes with a formal
+correctness proof.
 \subsection{Flattening Nested Alternatives}
 The idea behind the clause
 \begin{center}
-$\simpsulz  \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2) \quad \dn \quad
+	$\textit{simp}\_{SL}  \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2) \quad \dn \quad
 	       _{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$
 \end{center}
 is that it allows
 duplicate removal of regular expressions at different
 ``levels'' of alternatives.
 following simplification:
 \begin{center}
 $(a+r)+r \longrightarrow a+r$
 \end{center}
-The problem here is that only the head element
+The problem is that only the head element
-is ``spilled out'',
+is ``spilled out''.
-whereas we would want to flatten
+It is more desirable
-an entire list to open up possibilities for further simplifications.
+to flatten
+an entire list to open up possibilities for further simplifications
+with later regular expressions.
 Not flattening the rest of the elements also means that
 the later de-duplication processs
 does not fully remove further duplicates.
 For example,
-using $\simpsulz$ we could not
+using $\textit{simp}\_{SL}$ we cannot
 simplify
 \begin{center}
 	$((a^* a^*)+\underline{(a^* + a^*)})\cdot (a^*a^*)^*+
 ((a^*a^*)+a^*)\cdot (a^*a^*)^*$
 \end{center}
-due to the underlined part not in the first element
+due to the underlined part not being in the head
-of the alternative.\\
+of the alternative.
-We define a flatten operation that flattens not only
-the first regular expression of an alternative,
+We define our flatten operation so that it flattens
-but the entire list:
+the entire list:
 \begin{center}
 \begin{tabular}{@{}lcl@{}}
 $\textit{flts} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \;
 (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flts} \; as' $ \\
 $\textit{flts} \; \ZERO :: as'$ & $\dn$ & $ \textit{flts} \;  \textit{as'} $ \\
 \noindent
 Our $\flts$ operation
 also throws away $\ZERO$s
 as they do not contribute to a lexing result.
 \subsection{Duplicate Removal}
-After flattening is done, we are ready to deduplicate.
+After flattening is done, we can deduplicate.
 The de-duplicate function is called $\distinctBy$,
 and that is where we make our second improvement over
-Sulzmann and Lu's.
+Sulzmann and Lu's simplification method.
 The process goes as follows:
 \begin{center}
 $rs \stackrel{\textit{flts}}{\longrightarrow}
 rs_{flat}
 \xrightarrow{\distinctBy \;
 	\end{tabular}
 \end{center}
 \noindent
 The reason we define a distinct function under a mapping $f$ is because
 we want to eliminate regular expressions that are syntactically the same,
-but with different bit-codes.
+but have different bit-codes.
 For example, we can remove the second $a^*a^*$ from
 $_{ZSZ}a^*a^* + _{SZZ}a^*a^*$, because it
 represents a match with shorter initial sub-match
 (and therefore is definitely not POSIX),
 and will be discarded by $\bmkeps$ later.
 regular expressions we have already seen
 will definitely not contribute to a POSIX value,
 even if they are attached with different bitcodes.
 These duplicates therefore need to be removed.
 To achieve this, we call $\rerases$ as the function $f$ during the distinction
-operation.\\
+operation. The function
 $\rerases$ is very similar to $\erase$, except that it preserves the structure
 when erasing an alternative regular expression.
 The reason why we use $\rerases$ instead of $\erase$ is that
 it keeps the structures of alternative
 annotated regular expressions
-whereas $\erase$ would turn it back into a binary structure.
+whereas $\erase$ would turn it back into a binary  tree structure.
 Not having to mess with the structure
 greatly simplifies the finiteness proof in chapter
-\ref{Finite} (we will follow up with more details there).
+\ref{Finite}.
 We give the definitions of $\rerases$ here together with
 the new datatype used by $\rerases$ (as our plain
 regular expression datatype does not allow non-binary alternatives).
-For the moment the reader can just think of
+For now we can think of
-$\rerases$ as $\erase$ and $\rrexp$ as plain regular expressions.
+$\rerases$ as the function $(\_)_\downarrow$ defined in chapter \ref{Bitcoded1}
+and $\rrexp$ as plain regular expressions, but having a general list constructor
+for alternatives:
 \begin{figure}[H]
 \begin{center}
 	$\rrexp ::=   \RZERO \mid  \RONE
 			 \mid  \RCHAR{c}
 			 \mid  \RSEQ{r_1}{r_2}
 			 \mid \RSTAR{r}        $
 \end{center}
 \caption{$\rrexp$: plain regular expressions, but with $\sum$ alternative
 constructor}\label{rrexpDef}
 \end{figure}
-The notation of $\rerases$ also follows that of $\erase$,
+The function $\rerases$ we define as follows:
-which is a postfix operator written as a subscript,
-except that it has an \emph{r} attached to it to distinguish against $\erase$:
 \begin{center}
 \begin{tabular}{lcl}
 $\rerase{\ZERO}$ & $\dn$ & $\RZERO$\\
 $\rerase{_{bs}\ONE}$ & $\dn$ & $\RONE$\\
 	$\rerase{_{bs}\mathbf{c}}$ & $\dn$ & $\RCHAR{c}$\\
 $\rerase{_{bs} a ^*}$ & $\dn$ & $\rerase{a}^*$
 \end{tabular}
 \end{center}
 \subsection{Putting Things Together}
-A recursive definition of our  simplification function
+We can now give the definition of our  simplification function:
-is given below:
 %that looks somewhat similar to our Scala code is
 \begin{center}
 \begin{tabular}{@{}lcl@{}}
 	  $\textit{bsimp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ \textit{bsimp}_{ASEQ} \; bs \;(\textit{bsimp} \; a_1) \; (\textit{bsimp}  \; a_2)  $ \\
 The most involved part is the $\sum$ clause, where we first call $\flts$ on
 the simplified children regular expression list $\textit{map}\; \textit{bsimp}\; \textit{as}$.
 and then call $\distinctBy$ on that list, the predicate determining whether two
 elements are the same is $\rerases \; r_1 = \rerases\; r_2$.
 Finally, depending on whether the regular expression list $as'$ has turned into a
-singleton or empty list after $\flts$ and $\distinctBy$, $\textit{bsimp}_{AALTS}$
+singleton or empty list after $\flts$ and $\distinctBy$, $\textit{bsimp}_{ALTS}$
 decides whether to keep the current level constructor $\sum$ as it is, and
 removes it when there are less than two elements:
 \begin{center}
 	\begin{tabular}{lcl}
-		$\textit{bsimp}_{AALTS} \; bs \; as'$ & $ \dn$ & $ as' \; \textit{match}$\\
+		$\textit{bsimp}_{ALTS} \; bs \; as'$ & $ \dn$ & $ as' \; \textit{match}$\\
 &&$\quad\textit{case} \; [] \Rightarrow  \ZERO$ \\
 &&$\quad\textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
 &&$\quad\textit{case} \;  as' \Rightarrow _{bs}\sum \textit{as'}$\\
 	\end{tabular}
 \end{center}
-Having defined the $\bsimp$ function,
+Having defined the $\textit{bsimp}$ function,
-we add it as a phase after a derivative is taken,
+we add it as a phase after a derivative is taken.
-so it stays small:
 \begin{center}
 	\begin{tabular}{lcl}
 		$r \backslash_{bsimp} s$ & $\dn$ & $\textit{bsimp}(r \backslash s)$
 	\end{tabular}
 \end{center}
 %Following previous notations
 %when extending from derivatives w.r.t.~character to derivative
 %w.r.t.~string, we define the derivative that nests simplifications
 %with derivatives:%\comment{simp in  the [] case?}
-We extend this from character to string:
+We extend this from characters to strings:
 \begin{center}
 \begin{tabular}{lcl}
 $r \backslash_{bsimps} (c\!::\!s) $ & $\dn$ & $(r \backslash_{bsimp}\, c) \backslash_{bsimps}\, s$ \\
 $r \backslash_{bsimps} [\,] $ & $\dn$ & $r$
 \end{tabular}
 \subsection{Examples $(a+aa)^*$ and $(a^*\cdot a^*)^*$
 After Simplification}
 Recall the
 previous $(a^*a^*)^*$ example
-where $\simpsulz$ could not
+where $\textit{simp}\_{SL}$ could not
 prevent the fast growth (over
 3 million nodes just below $20$ input length)
 will be reduced to just 15 and stays constant no matter how long the
 input string is.
 This is shown in the graphs below.
 \begin{axis}[
 xlabel={$n$},
 ylabel={derivative size},
 width = 7cm,
 height = 4cm,
-legend entries={Lexer with $\simpsulz$},
+legend entries={Lexer with $\textit{simp}\_{SL}$},
 legend pos=  north west,
 legend cell align=left]
 \addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data};
 \end{axis}
 \end{tikzpicture}
 \caption{Our Improvement over Sulzmann and Lu's in terms of size}
 \end{figure}
 \noindent
 Given the size difference, it is not
 surprising that our $\blexersimp$ significantly outperforms
-$\textit{blexer\_sulzSimp}$.
+$\textit{blexer\_SLSimp}$.
-In the next section we are going to establish the
+In the next section we are going to establish that our
-first important property of our lexer--the correctness.
+simplification preserves the correctness of the algorithm.
 %----------------------------------------------------------------------------------------
 %	SECTION rewrite relation
 %----------------------------------------------------------------------------------------
 \section{Correctness of $\blexersimp$}
-In this section we give details
-of the correctness proof of $\blexersimp$,
-one of the contributions of this thesis.\\
 We first introduce the rewriting relation \emph{rrewrite}
 ($\rrewrite$) between two regular expressions,
-which expresses an atomic
+which stands for an atomic
 simplification.
 We then prove properties about
 this rewriting relation and its reflexive transitive closure.
 Finally we leverage these properties to show
-an equivalence between the internal data structures of
+an equivalence between the results generated by
 $\blexer$ and $\blexersimp$.
 \subsection{The Rewriting Relation $\rrewrite$($\rightsquigarrow$)}
 In the $\blexer$'s correctness proof, we
 did not directly derive the fact that $\blexer$ generates the POSIX value,
-but first proved that $\blexer$ is linked with $\lexer$.
+but first proved that $\blexer$ generates the same result as $\lexer$.
 Then we re-use
 the correctness of $\lexer$
 to obtain
 \begin{center}
-	$(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer \; r \;s = v$.
+	$(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer \; r \;s = v$\\
-\end{center}
+	$\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer\;
+	r\;s = \None$.
+\end{center}
+%\begin{center}
+%	$(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer \; r \;s = v$.
+%\end{center}
 Here we apply this
 modularised technique again
 by first proving that
 $\blexersimp \; r \; s $
 produces the same output as $\blexer \; r\; s$,
 and then piecing it together with
 $\blexer$'s correctness to achieve our main
-theorem:\footnote{ The case when
+theorem:
-$s$ is not in $L \; r$, is routine to establish.}
+\begin{center}
-\begin{center}
+	$(r, s) \rightarrow v \; \;   \textit{iff} \;\;  \blexersimp \; r \; s = \Some \;v$
-	$(r, s) \rightarrow v \; \;   \textit{iff} \;\;  \blexersimp \; r \; s = v$
+	\\
+	$\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp\;
+	r\;s = \None$
 \end{center}
 \noindent
 The overall idea for the proof
 of $\blexer \;r \;s = \blexersimp \; r \;s$
 is that the transition from $r$ to $\textit{bsimp}\; r$ can be
-broken down into finitely many rewrite steps:
+broken down into smaller rewrite steps of the form:
 \begin{center}
 	$r \rightsquigarrow^* \textit{bsimp} \; r$
 \end{center}
 where each rewrite step, written $\rightsquigarrow$,
 is an ``atomic'' simplification that
-is similar to a small-step reduction in operational semantics:
+is similar to a small-step reduction in operational semantics (
+see figure \ref{rrewriteRules} for the rules):
 \begin{figure}[H]
 \begin{mathpar}
 	\inferrule * [Right = $S\ZERO_l$]{\vspace{0em}}{_{bs} \ZERO \cdot r_2 \rightsquigarrow \ZERO\\}
 	\inferrule * [Right = $S\ZERO_r$]{\vspace{0em}}{_{bs} r_1 \cdot \ZERO \rightsquigarrow \ZERO\\}
 to be removed provided a regular expression
 earlier in the list can match the same strings.
 }\label{rrewriteRules}
 \end{figure}
 \noindent
-The rules such as $LT$ and $LH$ are for rewriting between two regular expression lists
+The rules $LT$ and $LH$ are for rewriting two regular expression lists
 such that one regular expression
 in the left-hand-side list is rewritable in one step
 to the right-hand-side's regular expression at the same position.
-This helps with defining the ``context rules'' such as $AL$.\\
+This helps with defining the ``context rule'' $AL$.\\
 The reflexive transitive closure of $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$
 are defined in the usual way:
 \begin{figure}[H]
 	\centering
 \begin{mathpar}
 \caption{The Reflexive Transitive Closure of
 $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$}\label{transClosure}
 \end{figure}
 %Two rewritable terms will remain rewritable to each other
 %even after a derivative is taken:
-Rewriting is preserved under derivatives,
+The main point of our rewriting relation
+is that it is preserved under derivatives,
 namely
 \begin{center}
 	$r_1 \rightsquigarrow r_2 \implies (r_1 \backslash c) \rightsquigarrow^* (r_2 \backslash c)$
 \end{center}
-And finally, if two terms are rewritable to each other,
+And also, if two terms are rewritable to each other,
 then they produce the same bitcodes:
 \begin{center}
 	$r \rightsquigarrow^* r' \;\; \textit{then} \; \; \bmkeps \; r = \bmkeps \; r'$
 \end{center}
 The decoding phase of both $\blexer$ and $\blexersimp$
-are the same, which means that if they get the same
+are the same, which means that if they receive the same
 bitcodes before the decoding phase,
-they get the same value after decoding is done.
+they generate the same value after decoding is done.
 We will prove the three properties
 we mentioned above in the next sub-section.
 \subsection{Important Properties of $\rightsquigarrow$}
 First we prove some basic facts
 about $\rightsquigarrow$, $\stackrel{s}{\rightsquigarrow}$,
 	which can be indutively proven by the inductive cases of $\rightsquigarrow$ and
 	$\stackrel{s}{\rightsquigarrow}$.
 \end{proof}
 \noindent
 The inference rules of $\stackrel{s}{\rightsquigarrow}$
-are defined in terms of list cons operation, here
+are defined in terms of the list cons operation, where
 we establish that the
 $\stackrel{s}{\rightsquigarrow}$ and $\stackrel{s*}{\rightsquigarrow}$
 relation is also preserved w.r.t appending and prepending of a list.
 In addition, we
 also prove some relations
 		\item
 			$rs_3 \stackrel{s*}{\rightsquigarrow} rs_4 \land r_1 \rightsquigarrow^* r_2 \implies
 			r_2 :: rs_3 \stackrel{s*}{\rightsquigarrow} r_2 :: rs_4$
 		\item
-			If we could rewrite a regular expression
+			If we can rewrite a regular expression
 			in many steps to $\ZERO$, then
-			we could also rewrite any sequence containing it to $\ZERO$:\\
+			we can also rewrite any sequence containing it to $\ZERO$:\\
 			$r_1 \rightsquigarrow^* \ZERO
 			\implies _{bs}r_1\cdot r_2 \rightsquigarrow^* \ZERO$
 	\end{itemize}
 \end{lemma}
 \begin{proof}
 	$\stackrel{s*}{\rightsquigarrow}$ and using part one to three.
 	The fifth part is a corollary of part four.
 	The last part is proven by rule induction again on $\rightsquigarrow^*$.
 \end{proof}
 \noindent
-Now we are ready to give the proofs of the below properties:
+Now we are ready to give the proofs of the following properties:
 \begin{itemize}
 	\item
-		$(r \rightsquigarrow^* r'\land \bnullable \; r_1)
+		$r \rightsquigarrow^* r'\land \bnullable \; r_1
 		\implies \bmkeps \; r = \bmkeps \; r'$. \\
 	\item
 		$r \rightsquigarrow^* \textit{bsimp} \;r$.\\
 	\item
 		$r \rightsquigarrow r' \implies r \backslash c \rightsquigarrow^* r'\backslash c$.\\
 \end{itemize}
-These properties would work together towards the correctness theorem.
-\subsubsection{Property 1: $(r \rightsquigarrow^* r'\land \bnullable \; r_1)
+\subsubsection{Property 1: $r \rightsquigarrow^* r'\land \bnullable \; r_1
 		\implies \bmkeps \; r = \bmkeps \; r'$}
 Intuitively, this property says we can
 extract the same bitcodes using $\bmkeps$ from the nullable
 components of two regular expressions $r$ and $r'$,
 if we can rewrite from one to the other in finitely
 having at least one nullable regular expressions:
 \begin{center}
 	$\textit{bnullables} \; rs \quad \dn \quad \exists r \in rs. \;\; \bnullable \; r$
 \end{center}
 \noindent
-The rewriting relation $\rightsquigarrow$ preserves nullability:
+The rewriting relation $\rightsquigarrow$ preserves (b)nullability:
 \begin{lemma}\label{rewritesBnullable}
 	\hspace{0em}
 	\begin{itemize}
 		\item
 			$\text{If} \; r_1 \rightsquigarrow r_2, \;
 \end{lemma}
 \begin{proof}
 	By rule induction over the cases that lead to $r_1 \rightsquigarrow r_2$.
 \end{proof}
 \noindent
-With lemma \ref{rewriteBmkepsAux} we are ready to prove its
+With lemma \ref{rewriteBmkepsAux} in place we are ready to prove its
 many-step version:
 \begin{lemma}
 	$\text{If} \;\; r \stackrel{*}{\rightsquigarrow} r' \;\; \text{and} \;\; \bnullable \; r, \;\;\; \text{then} \;\; \bmkeps \; r = \bmkeps \; r'$
 \end{lemma}
 \begin{proof}
-	By rule induction of $\stackrel{*}{\rightsquigarrow} $.
+	By rule induction of $\stackrel{*}{\rightsquigarrow} $. Lemma
-	$\ref{rewritesBnullable}$ tells us both $r$ and $r'$ are nullable.
+	$\ref{rewritesBnullable}$ gives us both $r$ and $r'$ are nullable.
-	\ref{rewriteBmkepsAux} solves the inductive case.
+	The lemma \ref{rewriteBmkepsAux} solves the inductive case.
 \end{proof}
-\subsubsection{Property 2: $r \stackrel{*}{\rightsquigarrow} \bsimp{r}$}
+\subsubsection{Property 2: $r \stackrel{*}{\rightsquigarrow} \textit{bsimp} \; r$}
-Now we get to the ``meaty'' part of the proof,
+Now we get to the key part of the proof,
 which says that our simplification's helper functions
-such as $\distinctBy$ and $\flts$ conform to
+such as $\distinctBy$ and $\flts$ describe
-the $\stackrel{s*}{\rightsquigarrow}$ and
+reducts of $\stackrel{s*}{\rightsquigarrow}$ and
-$\rightsquigarrow^* $ rewriting relations.\\
+$\rightsquigarrow^* $.\\
 The first lemma to prove is a more general version of
 $rs_ 1 \rightsquigarrow^* \distinctBy \; rs_1 \; \phi$:
 \begin{lemma}
 	$rs_1 @ rs_2 \stackrel{s*}{\rightsquigarrow}
 	(rs_1 @ (\distinctBy \; rs_2 \; \; \rerases \;\; (\map\;\; \rerases \; \; rs_1)))$
 we get the corollary
 \begin{corollary}\label{dBPreserves}
 	$rs_1 \stackrel{s*}{\rightsquigarrow} \distinctBy \; rs_1 \; \phi$.
 \end{corollary}
 \noindent
-The flatten function $\flts$ conforms to
+Similarly the flatten function $\flts$ describes a reduct of
 $\stackrel{s*}{\rightsquigarrow}$ as well:
 \begin{lemma}\label{fltsPreserves}
 	$rs \stackrel{s*}{\rightsquigarrow} \flts \; rs$
 \end{lemma}
 \begin{lemma}\label{bsimpaltsPreserves}
 	$_{bs} \sum rs \stackrel{*}{\rightsquigarrow} \bsimpalts \; _{bs} \; rs$
 \end{lemma}
 \noindent
 The simplification function
-$\textit{bsimp}$ only transforms the regex $r$ using steps specified by
+$\textit{bsimp}$ only transforms the regular expression  using steps specified by
-$\rightsquigarrow^*$ and nothing else.
+$\rightsquigarrow^*$ and nothing else:
 \begin{lemma}
-	$r \stackrel{*}{\rightsquigarrow} \bsimp{r}$
+	$r \stackrel{*}{\rightsquigarrow} \textit{bsimp} \; r$
 \end{lemma}
 \begin{proof}
 	By an induction on $r$.
-	The most involved case would be the alternative,
+	The most involved case is the alternative,
 	where we use lemmas \ref{bsimpaltsPreserves},
 	\ref{fltsPreserves} and \ref{dBPreserves} to do a series of rewriting:\\
 	\begin{center}
 		\begin{tabular}{lcl}
 			$rs$ &  $\stackrel{s*}{\rightsquigarrow}$ & $ \map \; \textit{bsimp} \; rs$\\
 			     &  $\stackrel{s*}{\rightsquigarrow}$ & $ \flts \; (\map \; \textit{bsimp} \; rs)$\\
 			     &  $\stackrel{s*}{\rightsquigarrow}$ & $ \distinctBy \;
 			(\flts \; (\map \; \textit{bsimp}\; rs)) \; \rerases \; \phi$\\
 		\end{tabular}
 	\end{center}
-	Using this we derive the following rewrite relation:\\
+	Using this we can derive the following rewrite sequence:\\
 	\begin{center}
 		\begin{tabular}{lcl}
 			$r$ & $=$ & $_{bs}\sum rs$\\[1.5ex]
 			    & $\rightsquigarrow^*$ & $\bsimpalts \; bs \; rs$ \\[1.5ex]
 			    & $\rightsquigarrow^*$ & $\ldots$ \\ [1.5ex]
 			    & $\rightsquigarrow^*$ & $\textit{bsimp} \; r$\\[1.5ex]
 		\end{tabular}
 	\end{center}
 \end{proof}
 \subsubsection{Property 3: $r_1 \stackrel{*}{\rightsquigarrow}  r_2 \implies r_1 \backslash c \stackrel{*}{\rightsquigarrow} r_2 \backslash c$}
-The rewritability relation
+The rewrite relation
-$\rightsquigarrow$ is preserved under derivatives--
+$\rightsquigarrow$ changes into $\stackrel{*}{\rightsquigarrow}$
-it is just that we might need multiple steps
+after derivatives are taken on both sides:
-where originally only one step was needed:
 \begin{lemma}\label{rewriteBder}
 	\hspace{0em}
 	\begin{itemize}
 		\item
 			If $r_1 \rightsquigarrow r_2$, then $r_1 \backslash c
 \begin{corollary}\label{rewritesBder}
 	$r_1 \rightsquigarrow^* r_2 \implies r_1 \backslash c \rightsquigarrow^*
 	r_2 \backslash c$
 \end{corollary}
 \begin{proof}
-	By rule induction of $\stackrel{*}{\rightsquigarrow} $ and using the previous lemma \ref{rewriteBder}.
+	By rule induction of $\stackrel{*}{\rightsquigarrow} $ and   lemma \ref{rewriteBder}.
 \end{proof}
 \noindent
 This can be extended and combined with $r \rightsquigarrow^* \textit{bsimp} \; r$
-to obtain the rewritability between
+to obtain the correspondence between
 $\blexer$ and $\blexersimp$'s intermediate
 derivative regular expressions
 \begin{lemma}\label{bderBderssimp}
 	$a \backslash s \rightsquigarrow^* \bderssimp{a}{s} $
 \end{lemma}
 \begin{proof}
 	By an induction on $s$.
 \end{proof}
 \subsection{Main Theorem}
-Now with \ref{bderBderssimp} we are ready for the main theorem.
+Now with \ref{bderBderssimp} in place we are ready for the main theorem.
 \begin{theorem}
 	$\blexer \; r \; s = \blexersimp{r}{s}$
 \end{theorem}
 \noindent
 \begin{proof}
-	One can rewrite in many steps from the original lexer's
+	We can rewrite in many steps from the original lexer's
 	derivative regular expressions to the
 	lexer with simplification applied (by lemma \ref{bderBderssimp}):
 	\begin{center}
 		$a \backslash s \stackrel{*}{\rightsquigarrow} \bderssimp{a}{s} $.
 	\end{center}
-	we know that they give out the same bits, if the lexing result is a match:
+	We know that they generate the same bits, if the lexing result is a match:
 	\begin{center}
 		$\bnullable \; (a \backslash s)
 		\implies \bmkeps \; (a \backslash s) = \bmkeps \; (\bderssimp{a}{s})$
 	\end{center}
-	Now that they give out the same bits, we know that they give the same value after decoding.
+	Now that they generate the same bits, we know that they give the same value after decoding.
 	\begin{center}
 		$\bnullable \; (a \backslash s)
 		\implies \decode \; r \; (\bmkeps \; (a \backslash s)) =
 		\decode \; r \; (\bmkeps \; (\bderssimp{a}{s}))$
 	\end{center}
-	Which is equivalent to our proof goal:
+	Which is required by our proof goal:
 	\begin{center}
 		$\blexer \; r \; s = \blexersimp \; r \; s$.
 	\end{center}
 \end{proof}
 \noindent
 As a corollary,
-we link this result with the lemma we proved earlier that
+we can link this result with the lemma we proved earlier that
 \begin{center}
-	$(r, s) \rightarrow v \;\; \textit{iff}\;\; \blexer \; r \; s = v$
+	$(r, s) \rightarrow v \;\; \textit{iff}\;\; \blexer \; r \; s = \Some \;v$\\
+	$\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer\;
+	r\;s = \None$.
 \end{center}
 and obtain the corollary that the bit-coded lexer with simplification is
 indeed correctly outputting POSIX lexing result, if such a result exists.
 \begin{corollary}
-	$(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp \; r\; s $
+	$(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp \; r\; s = \Some \; v$\\
+	$\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp\;
+	r\;s = \None$.
 \end{corollary}
 \subsection{Comments on the Proof Techniques Used}
 Straightforward and simple as the proof may seem,
-the efforts we spent obtaining it were far from trivial.\\
+the efforts we spent obtaining it were far from trivial.
 We initially attempted to re-use the argument
 in \cref{flex_retrieve}.
-The problem was that both functions $\inj$ and $\retrieve$ require
+The problem is that both functions $\inj$ and $\retrieve$ require
 that the annotated regular expressions stay unsimplified,
 so that one can
 correctly compare $v_{i+1}$ and $r_i$  and $v_i$
 in diagram \ref{graph:inj} and
 ``fit the key into the lock hole''.
 final,
 one could come up new rules
 such as
 $\SEQ r_1 \cdot (\SEQ r_1 \cdot r_3) \rightarrow
 \SEQs [r_1, r_2, r_3]$.
-This does not fit with the proof technique
+However this does not fit with the proof technique
 of our main theorem, but seem to not violate the POSIX
-property.\\
+property.
-Having correctness property is good.
-But we would also a guarantee that the lexer is not slow in
+Having established the correctness of our
-some sense, for exampe, not grinding to a halt regardless of the input.
+$\blexersimp$, in the next chapter we shall prove that with our $\simp$ function,
-As we have already seen, Sulzmann and Lu's simplification function
+for a given $r$, the derivative size is always
-$\simpsulz$ cannot achieve this, because their claim that
-the regular expression size does not grow arbitrary large
-was not true.
-In the next chapter we shall prove that with our $\simp$,
-for a given $r$, the internal derivative size is always
 finitely bounded by a constant.

changeset 624	8ffa28fce271
parent 601	ce4e5151a836
child 639	80cc6dc4c98b