lexing: comparison ChengsongTanPhdThesis/Chapters/Bitcoded2.tex

equal deleted inserted replaced

-:a7344c9afbaf
+:b2bea5968b89
 Now we introduce the simplifications, which is why we introduce the
 bitcodes in the first place.
-\subsection*{Simplification Rules}
+\section{Simplification for Annotated Regular Expressions}
+The first thing we notice in the fast growth of examples such as $(a^*a^*)^*$'s
-This section introduces aggressive (in terms of size) simplification rules
+and $(a^* + (aa)^*)^*$'s derivatives is that a lot of duplicated sub-patterns
-on annotated regular expressions
+are scattered around different levels:
-to keep derivatives small. Such simplifications are promising
-as we have
+\begin{center}
-generated test data that show
+	$(a^*a^*)^* \rightarrow (a^*a^* + a^*)\cdot(a^*a^*)^*$\\
-that a good tight bound can be achieved. We could only
+	$((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*$
-partially cover the search space as there are infinitely many regular
+\end{center}
-expressions and strings.
+\noindent
+Despite that we have already implemented the simple-minded simplifications
-One modification we introduced is to allow a list of annotated regular
+such as throwing away useless $\ONE$s and $\ZERO$s.
-expressions in the $\sum$ constructor. This allows us to not just
+The simplification rule $r + r \rightarrow $ cannot make a difference either
-delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but
+since it only removes duplicates on the same level, not something like
-also unnecessary ``copies'' of regular expressions (very similar to
+$(r+a)+r \rightarrow r+a$.
-simplifying $r + r$ to just $r$, but in a more general setting). Another
+This requires us to break up nested alternatives into lists, for example
-modification is that we use simplification rules inspired by Antimirov's
+using the flatten operation similar to the one defined for any function language's
-work on partial derivatives. They maintain the idea that only the first
+list datatype:
-``copy'' of a regular expression in an alternative contributes to the
-calculation of a POSIX value. All subsequent copies can be pruned away from
+\begin{center}
-the regular expression. A recursive definition of our  simplification function
+\begin{tabular}{@{}lcl@{}}
+$\textit{flts} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \;
+(\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flts} \; as' $ \\
+$\textit{flts} \; \ZERO :: as'$ & $\dn$ & $ \textit{flts} \;  \textit{as'} $ \\
+$\textit{flts} \; a :: as'$ & $\dn$ & $a :: \textit{flts} \; \textit{as'}$ \quad(otherwise)
+\end{tabular}
+\end{center}
+\noindent
+There is a minor difference though, in that our $\flts$ operation defined by us
+also throws away $\ZERO$s.
+For a flattened list of regular expressions, a de-duplication can be done efficiently:
+\begin{center}
+	\begin{tabular}{@{}lcl@{}}
+		$\distinctBy \; [] \; f\; acc $ & $ =$ & $ []$\\
+		$\distinctBy \; (x :: xs) \; f \; acc$ & $=$ & \\
+						       & & $\quad \textit{if} (f \; x) \in acc \textit{then} \distinctBy \; xs \; f \; acc$\\
+						       & & $\quad \textit{else} x :: (\distinctBy \; xs \; f \; (\{f \; x\} \cup acc))$
+	\end{tabular}
+\end{center}
+\noindent
+The reason we define a distinct function under a mapping $f$ is because
+we want to eliminate regular expressions that are the same syntactically,
+but having different bit-codes (more on the reason why we can do this later).
+To achieve this, we call $\erase$ as the function $f$ during the distinction
+operation.
+A recursive definition of our  simplification function
 that looks somewhat similar to our Scala code is given below:
-%\comment{Use $\ZERO$, $\ONE$ and so on.
-%Is it $ALTS$ or $ALTS$?}\\
 \begin{center}
 \begin{tabular}{@{}lcl@{}}
-$\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
+	  $\textit{bsimp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ \textit{bsimp}_{ASEQ} \; bs \;(\textit{bsimp} \; a_1) \; (\textit{bsimp}  \; a_2)  $ \\
+	  $\textit{bsimp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{bsimp}_{ALTS} \; \textit{bs} \; (\textit{distinctBy} \; ( \textit{flatten} ( \textit{map} \; bsimp \; as)) \; \erase \; \phi) $ \\
+$\textit{bsimp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$
+\end{tabular}
+\end{center}
+\noindent
+The simplification (named $\bsimp$ for \emph{b}it-coded) does a pattern matching on the regular expression.
+When it detected that the regular expression is an alternative or
+sequence, it will try to simplify its children regular expressions
+recursively and then see if one of the children turns into $\ZERO$ or
+$\ONE$, which might trigger further simplification at the current level.
+Current level simplifications are handled by the function $\textit{bsimp}_{ASEQ}$,
+using rules such as  $\ZERO \cdot r \rightarrow \ZERO$ and $\ONE \cdot r \rightarrow r$.
+\begin{center}
+	\begin{tabular}{@{}lcl@{}}
+		$\textit{bsimp}_{ASEQ} \; bs\; a \; b$ & $\dn$ & $ (a,\; b) \textit{match}$\\
 &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow  \ZERO$ \\
 &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow  \ZERO$ \\
-&&$\quad\textit{case} \;  (\ONE, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$ \\
+&&$\quad\textit{case} \;  (_{bs1}\ONE, a_2') \Rightarrow  \textit{fuse} \; (bs@bs_1) \;  a_2'$ \\
-&&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow  \textit{fuse} \; bs \;  a_1'$ \\
+&&$\quad\textit{case} \; (a_1', a_2') \Rightarrow   _{bs}a_1' \cdot a_2'$
-&&$\quad\textit{case} \; (a_1', a_2') \Rightarrow   _{bs}a_1' \cdot a_2'$ \\
+	\end{tabular}
+\end{center}
-$\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map} \; simp \; as)) \; \textit{match} $ \\
+\noindent
+The most involved part is the $\sum$ clause, where we first call $\flts$ on
+the simplified children regular expression list $\textit{map}\; \textit{bsimp}\; \textit{as}$.
+and then call $\distinctBy$ on that list, the predicate determining whether two
+elements are the same is $\erase \; r_1 = \erase\; r_2$.
+Finally, depending on whether the regular expression list $as'$ has turned into a
+singleton or empty list after $\flts$ and $\distinctBy$, $\textit{bsimp}_{AALTS}$
+decides whether to keep the current level constructor $\sum$ as it is, and
+removes it when there are less than two elements:
+\begin{center}
+	\begin{tabular}{lcl}
+		$\textit{bsimp}_{AALTS} \; bs \; as'$ & $ \dn$ & $ as' \; \textit{match}$\\
 &&$\quad\textit{case} \; [] \Rightarrow  \ZERO$ \\
 &&$\quad\textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
 &&$\quad\textit{case} \;  as' \Rightarrow _{bs}\sum \textit{as'}$\\
+	\end{tabular}
-$\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$
-\end{tabular}
+\end{center}
-\end{center}
+Having defined the $\bsimp$ function,
+we add it as a phase after a derivative is taken,
-\noindent
+so it stays small:
-The simplification does a pattern matching on the regular expression.
+\begin{center}
-When it detected that the regular expression is an alternative or
+	\begin{tabular}{lcl}
-sequence, it will try to simplify its child regular expressions
+		$r \backslash_{bsimp} s$ & $\dn$ & $\textit{bsimp}(r \backslash s)$
-recursively and then see if one of the children turns into $\ZERO$ or
+	\end{tabular}
-$\ONE$, which might trigger further simplification at the current level.
+\end{center}
-The most involved part is the $\sum$ clause, where we use two
+Following previous notation of  natural
-auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
-alternatives and reduce as many duplicates as possible. Function
-$\textit{distinct}$  keeps the first occurring copy only and removes all later ones
-when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s.
-Its recursive definition is given below:
-\begin{center}
-\begin{tabular}{@{}lcl@{}}
-$\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \;
-(\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\
-$\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \;  \textit{as'} $ \\
-$\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise)
-\end{tabular}
-\end{center}
-\noindent
-Here $\textit{flatten}$ behaves like the traditional functional programming flatten
-function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it
-removes parentheses, for example changing $a+(b+c)$ into $a+b+c$.
-Having defined the $\simp$ function,
-we can use the previous notation of  natural
 extension from derivative w.r.t.~character to derivative
-w.r.t.~string:%\comment{simp in  the [] case?}
+w.r.t.~string, we define the derivative that nests simplifications with derivatives:%\comment{simp in  the [] case?}
 \begin{center}
 \begin{tabular}{lcl}
-$r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\
+$r \backslash_{bsimps} (c\!::\!s) $ & $\dn$ & $(r \backslash_{bsimp}\, c) \backslash_{bsimps}\, s$ \\
-$r \backslash_{simp} [\,] $ & $\dn$ & $r$
+$r \backslash_{bsimps} [\,] $ & $\dn$ & $r$
 \end{tabular}
 \end{center}
 \noindent
-to obtain an optimised version of the algorithm:
+Extracting bit-codes from the derivatives that had been simplified repeatedly after
+each derivative run, the simplified $\blexer$, called $\blexersimp$,
+is defined as
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{blexer\_simp}\;r\,s$ & $\dn$ &
 $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\
 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
 \end{tabular}
 \end{center}
 \noindent
 This algorithm keeps the regular expression size small, for example,
-with this simplification our previous $(a + aa)^*$ example's 8000 nodes
+with this simplification our previous $(a + aa)^*$ example's fast growth (over
+$10^5$ nodes at around $20$ input length)
 will be reduced to just 17 and stays constant, no matter how long the
 input string is.
+We show some graphs to better demonstrate this imrpovement.
-\section{ $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ After Simplification}
+\section{$(a+aa)^*$ and $(a^*\cdot a^*)^*$  against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ After Simplification}
-We
+For $(a+aa)^*$, it used to grow to over $9000$ nodes with simple-minded simplification
-Unlike in \ref{fig:BetterWaterloo}, the size with simplification now stay nicely constant with current
+at only around $15$ input characters:
-simplification rules (we put the graphs together to show the contrast)
+\begin{center}
 \begin{tabular}{ll}
 \begin{tikzpicture}
 \begin{axis}[
 xlabel={$n$},
 ylabel={derivative size},
 width=7cm,
 height=4cm,
-legend entries={Lexer with $\bsimp$},
+legend entries={Lexer with $\textit{bsimp}$},
+legend pos=  south east,
+legend cell align=left]
+\addplot[red,mark=*, mark options={fill=white}] table {a_aa_star_bsimp.data};
+\end{axis}
+\end{tikzpicture} %\label{fig:BitcodedLexer}
+&
+\begin{tikzpicture}
+\begin{axis}[
+xlabel={$n$},
+ylabel={derivative size},
+width = 7cm,
+height = 4cm,
+legend entries={Lexer without $\textit{bsimp}$},
+legend pos=  north west,
+legend cell align=left]
+\addplot[red,mark=*, mark options={fill=white}] table {a_aa_star_easysimp.data};
+\end{axis}
+\end{tikzpicture}
+\end{tabular}
+\end{center}
+And for $(a^*\cdot a^*)^*$, unlike in \ref{fig:BetterWaterloo}, the size with simplification now stay nicely constant with current
+simplification rules (we put the graphs together to show the contrast)
+\begin{center}
+\begin{tabular}{ll}
+\begin{tikzpicture}
+\begin{axis}[
+xlabel={$n$},
+ylabel={derivative size},
+width=7cm,
+height=4cm,
+legend entries={Lexer with $\textit{bsimp}$},
 legend pos=  south east,
 legend cell align=left]
 \addplot[red,mark=*, mark options={fill=white}] table {BitcodedLexer.data};
 \end{axis}
 \end{tikzpicture} %\label{fig:BitcodedLexer}
 \begin{axis}[
 xlabel={$n$},
 ylabel={derivative size},
 width = 7cm,
 height = 4cm,
-legend entries={Lexer without $\bsimp$},
+legend entries={Lexer without $\textit{bsimp}$},
 legend pos=  north west,
 legend cell align=left]
 \addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data};
 \end{axis}
 \end{tikzpicture}
 \end{tabular}
+\end{center}
 %----------------------------------------------------------------------------------------
-%	SECTION common identities
+%	SECTION rewrite relation
 %----------------------------------------------------------------------------------------
-\section{Common Identities In Simplification-Related Functions}
+\section{The Rewriting Relation $\rrewrite$($\rightsquigarrow$)}
-Need to prove these before starting on the big correctness proof.
+The overall idea for the correctness
+\begin{conjecture}
-%-----------------------------------
+	$\blexersimp \; r \; s = \blexer \; r\; s$
-%	SUBSECTION
+\end{conjecture}
-%-----------------------------------
+is that the $\textit{bsimp}$ will not change the regular expressions so much that
-\subsection{Idempotency of $\simp$}
+it becomes impossible to extract the $\POSIX$ values.
+To capture this "similarity" between unsimplified regular expressions and simplified ones,
-\begin{equation}
+we devise the rewriting relation $\rrewrite$, written infix as $\rightsquigarrow$:
-	\simp \;r = \simp\; \simp \; r
-\end{equation}
+\begin{center}
-This property means we do not have to repeatedly
+\begin{mathpar}
-apply simplification in each step, which justifies
+	\inferrule{}{_{bs} \ZERO \cdot r_2 \rightsquigarrow \ZERO\\}
-our definition of $\blexersimp$.
-It will also be useful in future proofs where properties such as
+	\inferrule{}{_{bs} r_1 \cdot \ZERO \rightsquigarrow \ZERO\\}
-closed forms are needed.
-The proof is by structural induction on $r$.
+	\inferrule{}{_{bs1} ((_{bs2} \ONE) \cdot r) \rightsquigarrow \fuse \; (bs_1 @ bs_2) \; r\\}\\
-%-----------------------------------
-%	SUBSECTION
-%-----------------------------------
+	\inferrule{\\ r_1 \rightsquigarrow r_2}{_{bs} r_1 \cdot r_3 \rightsquigarrow _{bs} r_2 \cdot r_3\\}
-\subsection{Syntactic Equivalence Under $\simp$}
-We prove that minor differences can be annhilated
+	\inferrule{\\ r_3 \rightsquigarrow r_4}{_{bs} r_1 \cdot r_3 \rightsquigarrow _{bs} r_1 \cdot r_4\\}\\
-by $\simp$.
-For example,
+	\inferrule{}{ _{bs}\sum [] \rightsquigarrow \ZERO\\}
-\begin{center}
-$\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) =
+	\inferrule{}{ _{bs}\sum [a] \rightsquigarrow \fuse \; bs \; a\\}
-\simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$
-\end{center}
+	\inferrule{\\ rs_1 \stackrel{s}{\rightsquigarrow} rs_2}{_{bs}\sum rs_1 \rightsquigarrow rs_2\\}
+	\inferrule{}{\\ [] \stackrel{s}{\rightsquigarrow} []}
+	\inferrule{rs_1 \stackrel{s}{\rightsquigarrow} rs_2}{\\ r :: rs_1 \rightsquigarrow r :: rs_2 \rightsquigarrow}
+	\inferrule{r_1 \rightsquigarrow r_2}{ r_1 :: rs \stackrel{s}{\rightsquigarrow} r_2 :: rs\\}
-%----------------------------------------------------------------------------------------
+	\inferrule{}{\ZERO :: rs \stackrel{s}{\rightsquigarrow} rs}
-%	SECTION corretness proof
-%----------------------------------------------------------------------------------------
+	\inferrule{}{_{bs} \sum (rs_1 :: rs_b) \stackrel{s}{\rightsquigarrow} ((\map \; (\fuse \; bs_1) \; rs_1) @ rsb) }
-\section{Proof Technique of Correctness of Bit-coded Algorithm with Simplification}
+	\inferrule{\\ \rerase{a_1} = \rerase{a_2}}{rs_a @ [a_1] @ rs_b @ [a_2] @ rsc \stackrel{s}{\rightsquigarrow} rs_a @ [a_1] @ rs_b @ rs_c}
+\end{mathpar}
+\end{center}
+These "rewrite" steps define the atomic simplifications we could impose on regular expressions
+under our simplification algorithm.
+For convenience, we define a relation $\stackrel{s}{\rightsquigarrow}$ for rewriting
+a list of regular exression to another list.
+The $\rerase{}$ function is used instead of $_\downarrow$ for the finiteness bound proof of next chapter,
+which we will discuss later. For the moment the reader can assume they basically do the same thing as $\erase$.
+Usually more than one steps are taking place during the simplification of a regular expression,
+therefore we define the reflexive transitive closure of the $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$
+relations:
+\begin{center}
+\begin{mathpar}
+	\inferrule{}{ r \stackrel{*}{\rightsquigarrow} r \\}
+	\inferrule{}{\\ rs \stackrel{s*}{\rightsquigarrow} rs \\}
+	\inferrule{\\r_1 \stackrel{*}{\rightsquigarrow}  r_2 \land \; r_2 \stackrel{*}{\rightsquigarrow} r_3}{r_1 \stackrel{*}{\rightsquigarrow} r_3\\}
+	\inferrule{\\rs_1 \stackrel{*}{\rightsquigarrow}  rs_2 \land \; rs_2 \stackrel{*}{\rightsquigarrow} rs_3}{rs_1 \stackrel{*}{\rightsquigarrow} rs_3}
+\end{mathpar}
+\end{center}
+Now that we have modelled the rewriting behaviour of our simplifier $\bsimp$, we prove mainly
+three properties about how these relations connect to $\blexersimp$:
+\begin{itemize}
+	\item
+		$r \stackrel{*}{\rightsquigarrow} \bsimp{r}$
+		The algorithm $\bsimp$ only transforms the regex $r$ using steps specified by
+		$\stackrel{*}{\rightsquigarrow}$ and nothing else.
+	\item
+		$r \rightsquigarrow r' \implies r \backslash c \rightsquigarrow r'\backslash c$.
+		The relation $\stackrel{*}{rightsquigarrow}$ is preserved under derivative.
+	\item
+		$(r \stackrel{*}{\rightsquigarrow} r'\land \bnullable \; r_1) \implies \bmkeps \; r = \bmkeps \; r'$. If we reach another
+		expression in finitely many atomic simplification steps, then these two regular expressions
+		will produce the same bit-codes under the bit collection function $\bmkeps$ used by our $\blexer$.
+\end{itemize}
+\section{Three Important properties}
+These properties would work together towards the correctness theorem.
+We start proving each of these lemmas below.
+\subsection{$(r \stackrel{*}{\rightsquigarrow} r'\land \bnullable \; r_1) \implies \bmkeps \; r = \bmkeps \; r'$ and $r \stackrel{*}{\rightsquigarrow} \bsimp{r}$}
+The first few properties we establish is that the inference rules we gave for $\rightsquigarrow$
+and $\stackrel{s}{\rightsquigarrow}$ also hold as implications for $\stackrel{*}{\rightsquigarrow}$ and
+$\stackrel{s*}{\rightsquigarrow}$.
+\begin{lemma}
+	$rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \implies _{bs} \sum rs_1 \stackrel{*}{\rightsquigarrow} _{bs} \sum rs_2$
+\end{lemma}
+\begin{proof}
+	By rule induction of $\stackrel{s*}{\rightsquigarrow}$.
+\end{proof}
+\begin{lemma}
+	$r \stackrel{*}{\rightsquigarrow} r' \implies _{bs} \sum r :: rs \stackrel{*}{\rightsquigarrow}  _{bs} \sum r' :: rs$
+\end{lemma}
+\begin{proof}
+	By rule induction of $\stackrel{*}{\rightsquigarrow} $.
+\end{proof}
+\noindent
+Then we establish that the $\stackrel{s}{\rightsquigarrow}$ and $\stackrel{s*}{\rightsquigarrow}$ relation is preserved w.r.t appending and prepending of a  list:
+\begin{lemma}
+	$rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies rs @ rs_1 \stackrel{s}{\rightsquigarrow} rs @ rs_2$
+\end{lemma}
+\begin{proof}
+	By induction on the list $rs$.
+\end{proof}
+\begin{lemma}
+	$rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \implies rs @ rs_1 \stackrel{s*}{\rightsquigarrow} rs @ rs_2$
+\end{lemma}
+\begin{proof}
+	By rule induction of $\stackrel{s*}{\rightsquigarrow}$.
+\end{proof}
+\noindent
+The $\stackrel{s}{\rightsquigarrow} $ relation after appending a list becomes $\stackrel{s*}{\rightsquigarrow}$:
+\begin{lemma}\label{ssgqTossgs}
+	$rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies rs_1 @ rs \stackrel{s*}{\rightsquigarrow} rs_2 @ rs$
+\end{lemma}
+\begin{proof}
+	By rule induction of $\stackrel{s}{\rightsquigarrow}$.
+\end{proof}
+\begin{lemma}
+	$rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \implies rs_1 @ rs \stackrel{s*}{\rightsquigarrow} rs_2 @ rs$
+\end{lemma}
+\begin{proof}
+	By rule induction of $\stackrel{s*}{\rightsquigarrow}$ and using \ref{ssgqTossgs}.
+\end{proof}
+Here are two lemmas relating $\stackrel{*}{\rightsquigarrow}$ and $\stackrel{s*}{\rightsquigarrow}$:
+\begin{lemma}\label{singleton}
+	$r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies [r_1] \stackrel{s*}{\rightsquigarrow} [r_2]$
+\end{lemma}
+\begin{proof}
+	By rule induction of $ \stackrel{*}{\rightsquigarrow} $.
+\end{proof}
+\begin{lemma}
+	$rs_3 \stackrel{s*}{\rightsquigarrow} rs_4 \land r_1 \stackrel{*}{\rightsquigarrow}  r_2 \implies
+	r_2 :: rs_3 \stackrel{s*}{\rightsquigarrow} r_2 :: rs_4$
+\end{lemma}
+\begin{proof}
+	By using \ref{singleton}.
+\end{proof}
+Now we get to the "meaty" part of the proof, which relates the relations $\stackrel{s*}{\rightsquigarrow}$ and
+$\stackrel{*}{\rightsquigarrow} $ with our simplification components such $\distinctBy$ and $\flts$.
+The first lemma below says that for a list made of two parts $rs_1 @ rs_2$, one can throw away the duplicate
+elements in $rs_2$, as well as those that have appeared in $rs_1$:
+\begin{lemma}
+	$rs_1 @ rs_2 \stackrel{s*}{\rightsquigarrow} (rs_1 @ (\distinctBy \; rs_2 \; \; \rerase{\_}\;  \; (\map\;\; \rerase{\_}\; \; rs_1)))$
+\end{lemma}
+\begin{proof}
+	By induction on $rs_2$, where $rs_1$ is allowed to be arbitrary.
+\end{proof}
+The above h as the corollary that is suitable for the actual way $\distinctBy$ is called in $\bsimp$:
+\begin{lemma}\label{dBPreserves}
+	$rs_ 1 \rightarrow \distinctBy \; rs_1 \; \phi$
+\end{lemma}
+The flatten function $\flts$ works within the $\rightsquigarrow$ relation:
+\begin{lemma}\label{fltsPreserves}
+	$rs \stackrel{s*}{\rightsquigarrow} \flts \; rs$
+\end{lemma}
+The rewriting in many steps property is composible in terms of regular expression constructors:
+\begin{lemma}
+	$r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies _{bs} r_1 \cdot r_3 \stackrel{*}{\rightsquigarrow} \;  _{bs} r_2 \cdot r_3 \quad $ and
+$r_3 \stackrel{*}{\rightsquigarrow} r_4 \implies _{bs} r_1 \cdot r_3 \stackrel{*}{\rightsquigarrow} _{bs} \; r_1 \cdot r_4$
+\end{lemma}
+The rewriting in many steps properties $\stackrel{*}{\rightsquigarrow}$ and $\stackrel{s*}{\rightsquigarrow}$ is preserved under the function $\fuse$:
+\begin{lemma}
+	$r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies \fuse \; bs \; r_1 \stackrel{*}{\rightsquigarrow} \; \fuse \; bs \; r_2 \quad $ and
+	$rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies \map \; (\fuse \; bs) \; rs_1 \stackrel{s*}{\rightsquigarrow} \map \; (\fuse \; bs) \; rs_2$
+\end{lemma}
+\begin{proof}
+	By the properties $r_1 \rightsquigarrow r_2 \implies \fuse \; bs \; r_1 \implies \fuse \; bs \; r_2$ and
+	$rs_2 \stackrel{s}{\rightsquigarrow} rs_3 \implies \map \; (\fuse \; bs) rs_2 \stackrel{s*}{\rightsquigarrow} \map \; (\fuse \; bs)\; rs_3$.
+\end{proof}
+\noindent
+If we could rewrite a regular expression in many steps to $\ZERO$, then
+we could also rewrite any sequence containing it to $\ZERO$:
+\begin{lemma}
+	$r_1 \stackrel{*}{\rightsquigarrow} \ZERO \implies _{bs}r_1\cdot r_2 \stackrel{*}{\rightsquigarrow} \ZERO$
+\end{lemma}
+\begin{lemma}
+	$\bmkeps \; (r \backslash s) = \bmkeps \; \bderssimp{r}{s}$
+\end{lemma}
+\noindent
+The function $\bsimpalts$ preserves rewritability:
+\begin{lemma}\label{bsimpaltsPreserves}
+	$_{bs} \sum rs \stackrel{*}{\rightsquigarrow} \bsimpalts \; _{bs} \; rs$
+\end{lemma}
+\noindent
+Before we give out the next lemmas, we define a predicate for a list of regular expressions
+having at least one nullable regular expressions:
+\begin{definition}
+	$\textit{bnullables} \; rs \dn  \exists r \in rs. \bnullable r$
+\end{definition}
+The rewriting relation $\rightsquigarrow$ preserves nullability:
+\begin{lemma}
+	$r_1 \rightsquigarrow r_2 \implies  \bnullable \; r_1 = \bnullable \; r_2$ and
+	$rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies \textit{bnullables} \; rs_1 = \textit{bnullables} \; rs_2$
+\end{lemma}
+\begin{proof}
+	By rule induction of $\rightarrow$ and $\stackrel{s}{\rightsquigarrow}$.
+\end{proof}
+So does the many steps rewriting:
+\begin{lemma}\label{rewritesBnullable}
+	$r_1 \stackrel{*}{\rightsquigarrow}  r_2 \implies \bnullable \; r_1 = \bnullable \; r_2$
+\end{lemma}
+\begin{proof}
+	By rule induction of $\stackrel{*}{\rightsquigarrow} $.
+\end{proof}
+\noindent
+And if both regular expressions in a rewriting relation are nullable, then they
+produce the same bit-codes:
+\begin{lemma}\label{rewriteBmkepsAux}
+	$r_1 \rightsquigarrow r_2 \implies (\bnullable \; r_1 \land \bnullable \; r_2 \implies \bmkeps \; r_1 =
+	\bmkeps \; r_2)$ and
+	$rs_ 1 \stackrel{s}{\rightsquigarrow} rs_2 \implies (\bnullables \; rs_1 \land \bnullables \; rs_2 \implies
+	\bmkepss \; rs_1 = \bmkepss \; rs2)$
+\end{lemma}
+\noindent
+The definition of $\bmkepss$ on list $rs$ is just to extract the bit-codes on the first element in $rs$ that
+is $bnullable$:
+\begin{center}
+	\begin{tabular}{lcl}
+		$\bmkepss \; [] $ & $\dn$ & $[]$\\
+		$\bmkepss \; r :: rs$ & $\dn$ & $\textit{if} \; \bnullable \; r then \; (\bmkeps \; r) \; \textit{else} \; \bmkepss \; rs$
+	\end{tabular}
+\end{center}
+\noindent
+And now we are ready to prove the key property that if you
+have two regular expressions, one rewritable in many steps to the other,
+and one of them is $\bnullable$, then they will both yield the same bits under $\bmkeps$:
+\begin{lemma}
+	$(r \stackrel{*}{\rightsquigarrow} r'\land \bnullable \; r_1) \implies \bmkeps \; r = \bmkeps \; r'$
+\end{lemma}
+\begin{proof}
+	By rule induction of $\stackrel{*}{\rightsquigarrow} $, using \ref{rewriteBmkepsAux} and $\ref{rewritesBnullable}$.
+\end{proof}
+\noindent
+the other property is also ready:
+\begin{lemma}
+	$r \stackrel{*}{\rightsquigarrow} \bsimp{r}$
+\end{lemma}
+\begin{proof}
+	By an induction on $r$.
+The most difficult case would be the alternative case, where we using properties such as \ref{bsimpaltsPreserves} and \ref{fltsPreserves} and \ref{dBPreserves}, we could continuously rewrite a list like:\\
+	$rs \stackrel{s*}{\rightsquigarrow} \map \; \textit{bsimp} \; rs$\\
+	$\ldots \stackrel{s*}{\rightsquigarrow} \flts \; (\map \; \textit{bsimp} \; rs)$\\
+	$\ldots \;\stackrel{s*}{\rightsquigarrow} \distinctBy \; (\flts \; (\map \; \textit{bsimp}\; rs)) \; \rerase \; \phi$\\
+	Then we could do the following regular expresssion many steps rewrite:\\
+	$ _{bs} \sum \distinctBy \; (\flts \; (\map \; \textit{bsimp}\; rs)) \; \rerase \; \phi \stackrel{*}{\rightsquigarrow} \bsimpalts \; bs \; (\distinctBy \; (\flts \; (\map \; \textit{bsimp}\; rs)) \; \rerase \; \phi)$
+	\\
+\end{proof}
+\section{Proof for the Property: $r_1 \stackrel{*}{\rightsquigarrow}  r_2 \implies r_1 \backslash c \stackrel{*}{\rightsquigarrow} r_2 \backslash c$}
+The first thing we prove is that if we could rewrite in one step, then after derivative
+we could rewrite in many steps:
+\begin{lemma}\label{rewriteBder}
+	$r_1 \rightsquigarrow r_2 \implies r_1 \backslash c \stackrel{*}{\rightsquigarrow}  r_2 \backslash c$ and
+	$rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies \map \; (\_\backslash c) \; rs_1 \stackrel{s*}{\rightsquigarrow} \map \; (\_ \backslash c) \; rs_2$
+\end{lemma}
+\begin{proof}
+	By induction on $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$, using a number of the previous lemmas.
+\end{proof}
+Now we can prove that once we could rewrite from one expression to another in many steps,
+then after a derivative on both sides we could still rewrite one to another in many steps:
+\begin{lemma}
+	$r_1 \stackrel{*}{\rightsquigarrow}  r_2 \implies r_1 \backslash c \stackrel{*}{\rightsquigarrow}  r_2 \backslash c$
+\end{lemma}
+\begin{proof}
+	By rule induction of $\stackrel{*}{\rightsquigarrow} $ and using the previous lemma :\ref{rewriteBder}.
+\end{proof}
+\noindent
+This can be extended and combined with the previous two important properties
+so that a regular expression's successivve derivatives can be rewritten in many steps
+to its simplified counterpart:
+\begin{lemma}\label{bderBderssimp}
+	$a \backslash s \stackrel{*}{\rightsquigarrow} \bderssimp{a}{s} $
+\end{lemma}
+\subsection{Main Theorem}
+Now with \ref{bdersBderssimp} we are ready for the main theorem.
+To link $\blexersimp$ and $\blexer$,
+we first say that they give out the same bits, if the lexing result is a match:
+\begin{lemma}
+	$\bnullable \; (a \backslash s) \implies \bmkeps \; (a \backslash s) = \bmkeps \; (\bderssimp{a}{s})$
+\end{lemma}
+\noindent
+Now that they give out the same bits, we know that they give the same value after decoding,
+which we know is correct value as $\blexer$ is correct:
+\begin{theorem}
+	$\blexer \; r \; s = \blexersimp{r}{s}$
+\end{theorem}
+\noindent
+\subsection{Comments on the Proof Techniques Used}
 The non-trivial part of proving the correctness of the algorithm with simplification
 compared with not having simplification is that we can no longer use the argument
 in \cref{flex_retrieve}.
-The function \retrieve needs the structure of the annotated regular expression to
+The function \retrieve needs the cumbersome structure of the (umsimplified)
+annotated regular expression to
 agree with the structure of the value, but simplification will always mess with the
-structure:
+structure.
-%TODO: after simp does not agree with each other: (a + 0) --> a v.s. Left(Char(a))
+We also tried to prove $\bsimp{\bderssimp{a}{s}} = \bsimp{a\backslash s}$,
+but this turns out to be not true, A counterexample of this being
+\[ r = [(1+c)\cdot [aa \cdot (1+c)]] \land s = aa
+\]
+Then we would have $\bsimp{a \backslash s}$ being
+$_{[]}(_{ZZ}\ONE +  _{ZS}c ) $
+whereas $\bsimp{\bderssimp{a}{s}}$ would be
+$_{Z}(_{Z} \ONE + _{S} c)$.
+Unfortunately if we apply $\textit{bsimp}$ at different
+stages we will always have this discrepancy, due to
+whether the $\map \; (\fuse\; bs) \; as$ operation in $\textit{bsimp}$
+is taken at some points will be entirely dependant on when the simplification
+take place whether there is a larger alternative structure surrounding the
+alternative being simplified.
+The good thing about $\stackrel{*}{\rightsquigarrow} $ is that it allows
+us not specify how exactly the "atomic" simplification steps $\rightsquigarrow$
+are taken, but simply say that they can be taken to make two similar
+regular expressions equal, and can be done after interleaving derivatives
+and simplifications.
+Having correctness property is good. But we would also like the lexer to be efficient in
+some sense, for exampe, not grinding to a halt at certain cases.
+In the next chapter we shall prove that for a given $r$, the internal derivative size is always
+finitely bounded by a constant.

changeset 543	b2bea5968b89
parent 539	7cf9f17aa179
child 576	3e1b699696b6