lexing: ChengsongTanPhdThesis/Chapters/Bitcoded2.tex@35df9cdd36ca


% Chapter Template

% Main chapter title
\chapter{Correctness of Bit-coded Algorithm with Simplification}

\label{Bitcoded2} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX}
%Then we illustrate how the algorithm without bitcodes falls short for such aggressive 
%simplifications and therefore introduce our version of the bitcoded algorithm and 
%its correctness proof in 
%Chapter 3\ref{Chapter3}. 



Now we introduce the simplifications, which is why we introduce the 
bitcodes in the first place.

\section{Simplification for Annotated Regular Expressions}
The first thing we notice in the fast growth of examples such as $(a^*a^*)^*$'s
and $(a^* + (aa)^*)^*$'s derivatives is that a lot of duplicated sub-patterns
are scattered around different levels, and therefore requires 
de-duplication at different levels:
\begin{center}
	$(a^*a^*)^* \rightarrow (a^*a^* + a^*)\cdot(a^*a^*)^* \rightarrow $\\
	$((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*$
\end{center}
\noindent
As we have already mentioned in \ref{eqn:growth2},
a simple-minded simplification function cannot simplify
$((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*$
any further.
we would expect a better simplification function to work in the 
following way:
\begin{gather*}
	((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* + 
	\underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}.\\
	\bigg\downarrow \\
	(a^*a^* + a^* 
	\color{gray} + a^* \color{black})\cdot(a^*a^*)^* + 
	\underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this} \\
	\bigg\downarrow \\
	(a^*a^* + a^* 
	)\cdot(a^*a^*)^*  
	\color{gray} + (a^*a^* + a^*) \cdot(a^*a^*)^*
\end{gather*}
\noindent
This motivating example came from testing Sulzmann and Lu's 
algorithm: their simplification does 
not work!
We quote their $\textit{simp}$ function verbatim here:
\begin{center}
	\begin{tabular}{lcl}
		$\simp\; _{bs}(_{bs'}\ONE \cdot r)$ & $\dn$ & 
		$\textit{if} \; (\textit{zeroable} \; r)\; \textit{then} \;\; \ZERO$\\
						   & &$\textit{else}\;\; \fuse \; (bs@ bs') \; r$\\
		$\simp\;(_{bs}r_1\cdot r_2)$ & $\dn$ & $\textit{if} 
		\; (\textit{zeroable} \; r_1 \; \textit{or} \; \textit{zeroable}\; r_2)\;
		\textit{then} \;\; \ZERO$\\
					     & & $\textit{else}\;\;_{bs}((\simp\;r_1)\cdot
					     (\simp\; r_2))$\\
		$\simp \; _{bs}\sum []$ & $\dn$ & $\ZERO$\\
		$\simp \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ &
		$_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\
		$\simp \; _{bs}\sum[r]$ & $\dn$ & $\fuse \; bs \; (\simp \; r)$\\
		$\simp \; _{bs}\sum(r::rs)$ & $\dn$ & $_{bs}\sum 
		(\nub \; (\filter \; (\not \circ \zeroable)\;((\simp \; r) :: \map \; \simp \; rs)))$\\ 
		
	\end{tabular}
\end{center}
\noindent
the $\textit{zeroable}$ predicate 
which tests whether the regular expression
is equivalent to $\ZERO$,
is defined as:
\begin{center}
	\begin{tabular}{lcl}
		$\zeroable \; _{bs}\sum (r::rs)$ & $\dn$ & $\zeroable \; r\;\; \land \;\;
		\zeroable \;_{[]}\sum\;rs $\\
		$\zeroable\;_{bs}(r_1 \cdot r_2)$ & $\dn$ & $\zeroable\; r_1 \;\; \lor \;\; \zeroable \; r_2$\\
		$\zeroable\;_{bs}r^*$ & $\dn$ & $\textit{false}$ \\
		$\zeroable\;_{bs}c$ & $\dn$ & $\textit{false}$\\
		$\zeroable\;_{bs}\ONE$ & $\dn$ & $\textit{false}$\\
		$\zeroable\;_{bs}\ZERO$ & $\dn$ & $\textit{true}$
	\end{tabular}
\end{center}
%\[
%\begin{aligned}
%&\textit{isPhi} \left(b s @ r i^{*}\right)=\text { False } \\
%&\textit{isPhi}\left(b s @ r i_{1} r i_{2}\right)=\textit{isPhi}  r i_{1} \vee \textit{isPhi}  r i_{2} \\
%&\textit{isPhi} \left(b s @ r i_{1} \oplus r i_{2}\right)=\textit{isPhi}  r i_{1} \wedge \textit{isPhi}  r i_{2} \\
%&\textit{isPhi}(b s @ l)= False \\
%&\textit{isPhi}(b s @ \epsilon)= False \\
%&\textit{isPhi} \; \ZERO = True
%\end{aligned}
%\]
\noindent
Despite that we have already implemented the simple-minded simplifications 
such as throwing away useless $\ONE$s and $\ZERO$s.
The simplification rule $r + r \rightarrow $ cannot make a difference either
since it only removes duplicates on the same level, not something like 
$(r+a)+r \rightarrow r+a$.
This requires us to break up nested alternatives into lists, for example 
using the flatten operation similar to the one defined for any function language's
list datatype:

 \begin{center}
  \begin{tabular}{@{}lcl@{}}
  $\textit{flts} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \;
     (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flts} \; as' $ \\
  $\textit{flts} \; \ZERO :: as'$ & $\dn$ & $ \textit{flts} \;  \textit{as'} $ \\
    $\textit{flts} \; a :: as'$ & $\dn$ & $a :: \textit{flts} \; \textit{as'}$ \quad(otherwise) 
\end{tabular}    
\end{center}  

\noindent
There is a minor difference though, in that our $\flts$ operation defined by us
also throws away $\ZERO$s.
For a flattened list of regular expressions, a de-duplication can be done efficiently:


\begin{center}
	\begin{tabular}{@{}lcl@{}}
		$\distinctBy \; [] \; f\; acc $ & $ =$ & $ []$\\
		$\distinctBy \; (x :: xs) \; f \; acc$ & $=$ & \\
						       & & $\quad \textit{if} (f \; x) \in acc \textit{then} \distinctBy \; xs \; f \; acc$\\
						       & & $\quad \textit{else} x :: (\distinctBy \; xs \; f \; (\{f \; x\} \cup acc))$ 
	\end{tabular}
\end{center}
\noindent
The reason we define a distinct function under a mapping $f$ is because
we want to eliminate regular expressions that are the same syntactically,
but having different bit-codes (more on the reason why we can do this later).
To achieve this, we call $\erase$ as the function $f$ during the distinction
operation.
A recursive definition of our  simplification function 
that looks somewhat similar to our Scala code is given below:

\begin{center}
  \begin{tabular}{@{}lcl@{}}
   
	  $\textit{bsimp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ \textit{bsimp}_{ASEQ} \; bs \;(\textit{bsimp} \; a_1) \; (\textit{bsimp}  \; a_2)  $ \\
	  $\textit{bsimp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{bsimp}_{ALTS} \; \textit{bs} \; (\textit{distinctBy} \; ( \textit{flatten} ( \textit{map} \; bsimp \; as)) \; \erase \; \phi) $ \\
   $\textit{bsimp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$   
\end{tabular}    
\end{center}    

\noindent
The simplification (named $\bsimp$ for \emph{b}it-coded) does a pattern matching on the regular expression.
When it detected that the regular expression is an alternative or
sequence, it will try to simplify its children regular expressions
recursively and then see if one of the children turns into $\ZERO$ or
$\ONE$, which might trigger further simplification at the current level.
Current level simplifications are handled by the function $\textit{bsimp}_{ASEQ}$,
using rules such as  $\ZERO \cdot r \rightarrow \ZERO$ and $\ONE \cdot r \rightarrow r$.
\begin{center}
	\begin{tabular}{@{}lcl@{}}
		$\textit{bsimp}_{ASEQ} \; bs\; a \; b$ & $\dn$ & $ (a,\; b) \textit{match}$\\
   &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow  \ZERO$ \\
   &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow  \ZERO$ \\
   &&$\quad\textit{case} \;  (_{bs1}\ONE, a_2') \Rightarrow  \textit{fuse} \; (bs@bs_1) \;  a_2'$ \\
   &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow   _{bs}a_1' \cdot a_2'$ 
	\end{tabular}
\end{center}
\noindent
The most involved part is the $\sum$ clause, where we first call $\flts$ on
the simplified children regular expression list $\textit{map}\; \textit{bsimp}\; \textit{as}$.
and then call $\distinctBy$ on that list, the predicate determining whether two 
elements are the same is $\erase \; r_1 = \erase\; r_2$.
Finally, depending on whether the regular expression list $as'$ has turned into a
singleton or empty list after $\flts$ and $\distinctBy$, $\textit{bsimp}_{AALTS}$
decides whether to keep the current level constructor $\sum$ as it is, and 
removes it when there are less than two elements:
\begin{center}
	\begin{tabular}{lcl}
		$\textit{bsimp}_{AALTS} \; bs \; as'$ & $ \dn$ & $ as' \; \textit{match}$\\		
  &&$\quad\textit{case} \; [] \Rightarrow  \ZERO$ \\
   &&$\quad\textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
   &&$\quad\textit{case} \;  as' \Rightarrow _{bs}\sum \textit{as'}$\\ 
	\end{tabular}
	
\end{center}
Having defined the $\bsimp$ function,
we add it as a phase after a derivative is taken,
so it stays small:
\begin{center}
	\begin{tabular}{lcl}
		$r \backslash_{bsimp} s$ & $\dn$ & $\textit{bsimp}(r \backslash s)$
	\end{tabular}
\end{center}
Following previous notation of  natural
extension from derivative w.r.t.~character to derivative
w.r.t.~string, we define the derivative that nests simplifications with derivatives:%\comment{simp in  the [] case?}

\begin{center}
\begin{tabular}{lcl}
$r \backslash_{bsimps} (c\!::\!s) $ & $\dn$ & $(r \backslash_{bsimp}\, c) \backslash_{bsimps}\, s$ \\
$r \backslash_{bsimps} [\,] $ & $\dn$ & $r$
\end{tabular}
\end{center}

\noindent
Extracting bit-codes from the derivatives that had been simplified repeatedly after 
each derivative run, the simplified $\blexer$, called $\blexersimp$,
is defined as 
 \begin{center}
\begin{tabular}{lcl}
  $\textit{blexer\_simp}\;r\,s$ & $\dn$ &
      $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\                
  & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
  & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
  & & $\;\;\textit{else}\;\textit{None}$
\end{tabular}
\end{center}

\noindent
This algorithm keeps the regular expression size small, for example,
with this simplification our previous $(a + aa)^*$ example's fast growth (over
$10^5$ nodes at around $20$ input length)
will be reduced to just 17 and stays constant, no matter how long the
input string is.
We show some graphs to better demonstrate this imrpovement.


\section{$(a+aa)^*$ and $(a^*\cdot a^*)^*$  against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ After Simplification}
For $(a+aa)^*$, it used to grow to over $9000$ nodes with simple-minded simplification
at only around $15$ input characters:
\begin{center}
\begin{tabular}{ll}
\begin{tikzpicture}
\begin{axis}[
    xlabel={$n$},
    ylabel={derivative size},
        width=7cm,
    height=4cm, 
    legend entries={Lexer with $\textit{bsimp}$},  
    legend pos=  south east,
    legend cell align=left]
\addplot[red,mark=*, mark options={fill=white}] table {a_aa_star_bsimp.data};
\end{axis}
\end{tikzpicture} %\label{fig:BitcodedLexer}
&
\begin{tikzpicture}
\begin{axis}[
    xlabel={$n$},
    ylabel={derivative size},
    width = 7cm,
    height = 4cm,
    legend entries={Lexer without $\textit{bsimp}$},  
    legend pos=  north west,
    legend cell align=left]
\addplot[red,mark=*, mark options={fill=white}] table {a_aa_star_easysimp.data};
\end{axis}
\end{tikzpicture} 
\end{tabular}
\end{center}
And for $(a^*\cdot a^*)^*$, unlike in \ref{fig:BetterWaterloo}, the size with simplification now stay nicely constant with current
simplification rules (we put the graphs together to show the contrast)
\begin{center}
\begin{tabular}{ll}
\begin{tikzpicture}
\begin{axis}[
    xlabel={$n$},
    ylabel={derivative size},
        width=7cm,
    height=4cm, 
    legend entries={Lexer with $\textit{bsimp}$},  
    legend pos=  south east,
    legend cell align=left]
\addplot[red,mark=*, mark options={fill=white}] table {BitcodedLexer.data};
\end{axis}
\end{tikzpicture} %\label{fig:BitcodedLexer}
&
\begin{tikzpicture}
\begin{axis}[
    xlabel={$n$},
    ylabel={derivative size},
    width = 7cm,
    height = 4cm,
    legend entries={Lexer without $\textit{bsimp}$},  
    legend pos=  north west,
    legend cell align=left]
\addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data};
\end{axis}
\end{tikzpicture} 
\end{tabular}
\end{center}

%----------------------------------------------------------------------------------------
%	SECTION rewrite relation
%----------------------------------------------------------------------------------------
\section{The Rewriting Relation $\rrewrite$($\rightsquigarrow$)}
In the $\blexer$'s correctness proof, we
did not directly derive the fact that $\blexer$ gives out the POSIX value,
but first prove that $\blexer$ is linked with $\lexer$.
Then we re-use
the correctness of $\lexer$.
Here we follow suit by first proving that
$\blexersimp \; r \; s $ 
produces the same output as $\blexer \; r\; s$,
and then putting it together with 
$\blexer$'s correctness result
($(r, s) \rightarrow v \implies \blexer \; r \;s = v$)
to obtain half of the correctness property (the other half
being when $s$ is not $L \; r$ which is routine to establish)
\begin{center}
	$(r, s) \rightarrow v \implies \blexersimp \; r \; s = v$
\end{center}
\noindent
because it makes the proof simpler
The overall idea for the correctness of our simplified bitcoded lexer
\noindent
is that the $\textit{bsimp}$ will not change the regular expressions so much that
it becomes impossible to extract the $\POSIX$ values.
To capture this "similarity" between unsimplified regular expressions and simplified ones,
we devise the rewriting relation $\rrewrite$, written infix as $\rightsquigarrow$:

\begin{center}
\begin{mathpar}
	\inferrule{}{_{bs} \ZERO \cdot r_2 \rightsquigarrow \ZERO\\}

	\inferrule{}{_{bs} r_1 \cdot \ZERO \rightsquigarrow \ZERO\\}

	\inferrule{}{_{bs1} ((_{bs2} \ONE) \cdot r) \rightsquigarrow \fuse \; (bs_1 @ bs_2) \; r\\}\\

	
	
	\inferrule{\\ r_1 \rightsquigarrow r_2}{_{bs} r_1 \cdot r_3 \rightsquigarrow _{bs} r_2 \cdot r_3\\}

	\inferrule{\\ r_3 \rightsquigarrow r_4}{_{bs} r_1 \cdot r_3 \rightsquigarrow _{bs} r_1 \cdot r_4\\}\\

	\inferrule{}{ _{bs}\sum [] \rightsquigarrow \ZERO\\}

	\inferrule{}{ _{bs}\sum [a] \rightsquigarrow \fuse \; bs \; a\\}

	\inferrule{\\ rs_1 \stackrel{s}{\rightsquigarrow} rs_2}{_{bs}\sum rs_1 \rightsquigarrow rs_2\\}

	\inferrule{}{\\ [] \stackrel{s}{\rightsquigarrow} []}

	\inferrule{rs_1 \stackrel{s}{\rightsquigarrow} rs_2}{\\ r :: rs_1 \rightsquigarrow r :: rs_2 \rightsquigarrow}

	\inferrule{r_1 \rightsquigarrow r_2}{ r_1 :: rs \stackrel{s}{\rightsquigarrow} r_2 :: rs\\}

	\inferrule{}{\ZERO :: rs \stackrel{s}{\rightsquigarrow} rs}

	\inferrule{}{_{bs} \sum (rs_1 :: rs_b) \stackrel{s}{\rightsquigarrow} ((\map \; (\fuse \; bs_1) \; rs_1) @ rsb) }

	\inferrule{\\ \rerase{a_1} = \rerase{a_2}}{rs_a @ [a_1] @ rs_b @ [a_2] @ rsc \stackrel{s}{\rightsquigarrow} rs_a @ [a_1] @ rs_b @ rs_c}

\end{mathpar}
\end{center}
These "rewrite" steps define the atomic simplifications we could impose on regular expressions
under our simplification algorithm.
For convenience, we define a relation $\stackrel{s}{\rightsquigarrow}$ for rewriting
a list of regular exression to another list.
The $\rerase{}$ function is used instead of $_\downarrow$ for the finiteness bound proof of next chapter,
which we will discuss later. For the moment the reader can assume they basically do the same thing as $\erase$.

Usually more than one steps are taking place during the simplification of a regular expression,
therefore we define the reflexive transitive closure of the $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$
relations:

\begin{center}
\begin{mathpar}
	\inferrule{}{ r \stackrel{*}{\rightsquigarrow} r \\}
	\inferrule{}{\\ rs \stackrel{s*}{\rightsquigarrow} rs \\}
	\inferrule{\\r_1 \stackrel{*}{\rightsquigarrow}  r_2 \land \; r_2 \stackrel{*}{\rightsquigarrow} r_3}{r_1 \stackrel{*}{\rightsquigarrow} r_3\\}
	\inferrule{\\rs_1 \stackrel{*}{\rightsquigarrow}  rs_2 \land \; rs_2 \stackrel{*}{\rightsquigarrow} rs_3}{rs_1 \stackrel{*}{\rightsquigarrow} rs_3}
\end{mathpar}
\end{center}
Now that we have modelled the rewriting behaviour of our simplifier $\bsimp$, we prove mainly 
three properties about how these relations connect to $\blexersimp$:
\begin{itemize}
	\item
		$r \stackrel{*}{\rightsquigarrow} \bsimp{r}$
		The algorithm $\bsimp$ only transforms the regex $r$ using steps specified by 
		$\stackrel{*}{\rightsquigarrow}$ and nothing else.
	\item
		$r \rightsquigarrow r' \implies r \backslash c \rightsquigarrow r'\backslash c$.
		The relation $\stackrel{*}{rightsquigarrow}$ is preserved under derivative. 
	\item
		$(r \stackrel{*}{\rightsquigarrow} r'\land \bnullable \; r_1) \implies \bmkeps \; r = \bmkeps \; r'$. If we reach another 
		expression in finitely many atomic simplification steps, then these two regular expressions
		will produce the same bit-codes under the bit collection function $\bmkeps$ used by our $\blexer$.

\end{itemize}
\section{Three Important properties}
These properties would work together towards the correctness theorem.
We start proving each of these lemmas below.
\subsection{$(r \stackrel{*}{\rightsquigarrow} r'\land \bnullable \; r_1) \implies \bmkeps \; r = \bmkeps \; r'$ and $r \stackrel{*}{\rightsquigarrow} \bsimp{r}$}
The first few properties we establish is that the inference rules we gave for $\rightsquigarrow$
and $\stackrel{s}{\rightsquigarrow}$ also hold as implications for $\stackrel{*}{\rightsquigarrow}$ and 
$\stackrel{s*}{\rightsquigarrow}$.
\begin{lemma}
	$rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \implies _{bs} \sum rs_1 \stackrel{*}{\rightsquigarrow} _{bs} \sum rs_2$
\end{lemma}
\begin{proof}
	By rule induction of $\stackrel{s*}{\rightsquigarrow}$.
\end{proof}
\begin{lemma}
	$r \stackrel{*}{\rightsquigarrow} r' \implies _{bs} \sum r :: rs \stackrel{*}{\rightsquigarrow}  _{bs} \sum r' :: rs$
\end{lemma}
\begin{proof}
	By rule induction of $\stackrel{*}{\rightsquigarrow} $.
\end{proof}
\noindent
Then we establish that the $\stackrel{s}{\rightsquigarrow}$ and $\stackrel{s*}{\rightsquigarrow}$ relation is preserved w.r.t appending and prepending of a  list:
\begin{lemma}
	$rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies rs @ rs_1 \stackrel{s}{\rightsquigarrow} rs @ rs_2$
\end{lemma}
\begin{proof}
	By induction on the list $rs$.
\end{proof}

\begin{lemma}
	$rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \implies rs @ rs_1 \stackrel{s*}{\rightsquigarrow} rs @ rs_2$
\end{lemma}
\begin{proof}
	By rule induction of $\stackrel{s*}{\rightsquigarrow}$.
\end{proof}

\noindent
The $\stackrel{s}{\rightsquigarrow} $ relation after appending a list becomes $\stackrel{s*}{\rightsquigarrow}$:
\begin{lemma}\label{ssgqTossgs}
	$rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies rs_1 @ rs \stackrel{s*}{\rightsquigarrow} rs_2 @ rs$
\end{lemma}
\begin{proof}
	By rule induction of $\stackrel{s}{\rightsquigarrow}$.
\end{proof}
\begin{lemma}
	$rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \implies rs_1 @ rs \stackrel{s*}{\rightsquigarrow} rs_2 @ rs$
\end{lemma}
\begin{proof}
	By rule induction of $\stackrel{s*}{\rightsquigarrow}$ and using \ref{ssgqTossgs}.
\end{proof}
Here are two lemmas relating $\stackrel{*}{\rightsquigarrow}$ and $\stackrel{s*}{\rightsquigarrow}$:
\begin{lemma}\label{singleton}
	$r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies [r_1] \stackrel{s*}{\rightsquigarrow} [r_2]$
\end{lemma}
\begin{proof}
	By rule induction of $ \stackrel{*}{\rightsquigarrow} $.
\end{proof}
\begin{lemma}
	$rs_3 \stackrel{s*}{\rightsquigarrow} rs_4 \land r_1 \stackrel{*}{\rightsquigarrow}  r_2 \implies
	r_2 :: rs_3 \stackrel{s*}{\rightsquigarrow} r_2 :: rs_4$
\end{lemma}
\begin{proof}
	By using \ref{singleton}.
\end{proof}
Now we get to the "meaty" part of the proof, which relates the relations $\stackrel{s*}{\rightsquigarrow}$ and 
$\stackrel{*}{\rightsquigarrow} $ with our simplification components such $\distinctBy$ and $\flts$.
The first lemma below says that for a list made of two parts $rs_1 @ rs_2$, one can throw away the duplicate
elements in $rs_2$, as well as those that have appeared in $rs_1$:
\begin{lemma}
	$rs_1 @ rs_2 \stackrel{s*}{\rightsquigarrow} (rs_1 @ (\distinctBy \; rs_2 \; \; \rerase{\_}\;  \; (\map\;\; \rerase{\_}\; \; rs_1)))$
\end{lemma}
\begin{proof}
	By induction on $rs_2$, where $rs_1$ is allowed to be arbitrary.
\end{proof}
The above h as the corollary that is suitable for the actual way $\distinctBy$ is called in $\bsimp$:
\begin{lemma}\label{dBPreserves}
	$rs_ 1 \rightarrow \distinctBy \; rs_1 \; \phi$
\end{lemma}



The flatten function $\flts$ works within the $\rightsquigarrow$ relation:
\begin{lemma}\label{fltsPreserves}
	$rs \stackrel{s*}{\rightsquigarrow} \flts \; rs$
\end{lemma}

The rewriting in many steps property is composible in terms of regular expression constructors:
\begin{lemma}
	$r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies _{bs} r_1 \cdot r_3 \stackrel{*}{\rightsquigarrow} \;  _{bs} r_2 \cdot r_3 \quad $ and 
$r_3 \stackrel{*}{\rightsquigarrow} r_4 \implies _{bs} r_1 \cdot r_3 \stackrel{*}{\rightsquigarrow} _{bs} \; r_1 \cdot r_4$
\end{lemma}

The rewriting in many steps properties $\stackrel{*}{\rightsquigarrow}$ and $\stackrel{s*}{\rightsquigarrow}$ is preserved under the function $\fuse$:
\begin{lemma}
	$r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies \fuse \; bs \; r_1 \stackrel{*}{\rightsquigarrow} \; \fuse \; bs \; r_2 \quad $ and 
	$rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies \map \; (\fuse \; bs) \; rs_1 \stackrel{s*}{\rightsquigarrow} \map \; (\fuse \; bs) \; rs_2$
\end{lemma}
\begin{proof}
	By the properties $r_1 \rightsquigarrow r_2 \implies \fuse \; bs \; r_1 \implies \fuse \; bs \; r_2$ and
	$rs_2 \stackrel{s}{\rightsquigarrow} rs_3 \implies \map \; (\fuse \; bs) rs_2 \stackrel{s*}{\rightsquigarrow} \map \; (\fuse \; bs)\; rs_3$.
\end{proof}
\noindent
If we could rewrite a regular expression in many steps to $\ZERO$, then 
we could also rewrite any sequence containing it to $\ZERO$:
\begin{lemma}
	$r_1 \stackrel{*}{\rightsquigarrow} \ZERO \implies _{bs}r_1\cdot r_2 \stackrel{*}{\rightsquigarrow} \ZERO$
\end{lemma}
\begin{lemma}
	$\bmkeps \; (r \backslash s) = \bmkeps \; \bderssimp{r}{s}$
\end{lemma}
\noindent
The function $\bsimpalts$ preserves rewritability:
\begin{lemma}\label{bsimpaltsPreserves}
	$_{bs} \sum rs \stackrel{*}{\rightsquigarrow} \bsimpalts \; _{bs} \; rs$
\end{lemma}
\noindent
Before we give out the next lemmas, we define a predicate for a list of regular expressions
having at least one nullable regular expressions:
\begin{definition}
	$\textit{bnullables} \; rs \dn  \exists r \in rs. \bnullable r$
\end{definition}
The rewriting relation $\rightsquigarrow$ preserves nullability:
\begin{lemma}
	$r_1 \rightsquigarrow r_2 \implies  \bnullable \; r_1 = \bnullable \; r_2$ and
	$rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies \textit{bnullables} \; rs_1 = \textit{bnullables} \; rs_2$
\end{lemma}
\begin{proof}
	By rule induction of $\rightarrow$ and $\stackrel{s}{\rightsquigarrow}$.
\end{proof}
So does the many steps rewriting:
\begin{lemma}\label{rewritesBnullable}
	$r_1 \stackrel{*}{\rightsquigarrow}  r_2 \implies \bnullable \; r_1 = \bnullable \; r_2$
\end{lemma}
\begin{proof}
	By rule induction of $\stackrel{*}{\rightsquigarrow} $.
\end{proof}
\noindent
And if both regular expressions in a rewriting relation are nullable, then they 
produce the same bit-codes:

\begin{lemma}\label{rewriteBmkepsAux}
	$r_1 \rightsquigarrow r_2 \implies (\bnullable \; r_1 \land \bnullable \; r_2 \implies \bmkeps \; r_1 = 
	\bmkeps \; r_2)$ and
	$rs_ 1 \stackrel{s}{\rightsquigarrow} rs_2 \implies (\bnullables \; rs_1 \land \bnullables \; rs_2 \implies 
	\bmkepss \; rs_1 = \bmkepss \; rs2)$
\end{lemma}
\noindent
The definition of $\bmkepss$ on list $rs$ is just to extract the bit-codes on the first element in $rs$ that 
is $bnullable$:
\begin{center}
	\begin{tabular}{lcl}
		$\bmkepss \; [] $ & $\dn$ & $[]$\\
		$\bmkepss \; r :: rs$ & $\dn$ & $\textit{if} \; \bnullable \; r then \; (\bmkeps \; r) \; \textit{else} \; \bmkepss \; rs$
	\end{tabular}
\end{center}
\noindent
And now we are ready to prove the key property that if you 
have two regular expressions, one rewritable in many steps to the other,
and one of them is $\bnullable$, then they will both yield the same bits under $\bmkeps$:
\begin{lemma}
	$\text{If} \;\; r \stackrel{*}{\rightsquigarrow} r' \;\; \text{and} \;\; \bnullable \; r_1 \;\; \text{then} \;\; \bmkeps \; r = \bmkeps \; r'$
\end{lemma}
\begin{proof}
	By rule induction of $\stackrel{*}{\rightsquigarrow} $, using \ref{rewriteBmkepsAux} and $\ref{rewritesBnullable}$.
\end{proof}
\noindent
the other property is also ready:
\begin{lemma}
	$r \stackrel{*}{\rightsquigarrow} \bsimp{r}$
\end{lemma}
\begin{proof}
	By an induction on $r$.
The most difficult case would be the alternative case, where we using properties such as \ref{bsimpaltsPreserves} and \ref{fltsPreserves} and \ref{dBPreserves}, we could continuously rewrite a list like:\\
	$rs \stackrel{s*}{\rightsquigarrow} \map \; \textit{bsimp} \; rs$\\
	$\ldots \stackrel{s*}{\rightsquigarrow} \flts \; (\map \; \textit{bsimp} \; rs)$\\
	$\ldots \;\stackrel{s*}{\rightsquigarrow} \distinctBy \; (\flts \; (\map \; \textit{bsimp}\; rs)) \; \rerase \; \phi$\\
	Then we could do the following regular expresssion many steps rewrite:\\
	$ _{bs} \sum \distinctBy \; (\flts \; (\map \; \textit{bsimp}\; rs)) \; \rerase \; \phi \stackrel{*}{\rightsquigarrow} \bsimpalts \; bs \; (\distinctBy \; (\flts \; (\map \; \textit{bsimp}\; rs)) \; \rerase \; \phi)$
	\\
	
\end{proof}
\section{Proof for the Property: $r_1 \stackrel{*}{\rightsquigarrow}  r_2 \implies r_1 \backslash c \stackrel{*}{\rightsquigarrow} r_2 \backslash c$}
The first thing we prove is that if we could rewrite in one step, then after derivative
we could rewrite in many steps:
\begin{lemma}\label{rewriteBder}
	$r_1 \rightsquigarrow r_2 \implies r_1 \backslash c \stackrel{*}{\rightsquigarrow}  r_2 \backslash c$ and
	$rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies \map \; (\_\backslash c) \; rs_1 \stackrel{s*}{\rightsquigarrow} \map \; (\_ \backslash c) \; rs_2$
\end{lemma}
\begin{proof}
	By induction on $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$, using a number of the previous lemmas.
\end{proof}
Now we can prove that once we could rewrite from one expression to another in many steps,
then after a derivative on both sides we could still rewrite one to another in many steps:
\begin{lemma}
	$r_1 \stackrel{*}{\rightsquigarrow}  r_2 \implies r_1 \backslash c \stackrel{*}{\rightsquigarrow}  r_2 \backslash c$
\end{lemma}
\begin{proof}
	By rule induction of $\stackrel{*}{\rightsquigarrow} $ and using the previous lemma :\ref{rewriteBder}.
\end{proof}
\noindent
This can be extended and combined with the previous two important properties
so that a regular expression's successivve derivatives can be rewritten in many steps
to its simplified counterpart:
\begin{lemma}\label{bderBderssimp}
	$a \backslash s \stackrel{*}{\rightsquigarrow} \bderssimp{a}{s} $
\end{lemma}
\subsection{Main Theorem}
Now with \ref{bdersBderssimp} we are ready for the main theorem.
To link $\blexersimp$ and $\blexer$, 
we first say that they give out the same bits, if the lexing result is a match:
\begin{lemma}
	$\bnullable \; (a \backslash s) \implies \bmkeps \; (a \backslash s) = \bmkeps \; (\bderssimp{a}{s})$
\end{lemma}
\noindent
Now that they give out the same bits, we know that they give the same value after decoding,
which we know is correct value as $\blexer$ is correct:
\begin{theorem}
	$\blexer \; r \; s = \blexersimp{r}{s}$
\end{theorem}
\noindent
\begin{proof}
	One can rewrite in many steps from the original lexer's derivative regular expressions to the 
	lexer with simplification applied:
	$a \backslash s \stackrel{*}{\rightsquigarrow} \bderssimp{a}{s} $.
	If two regular expressions are rewritable, then they produce the same bits.
	Under the condition that $ r_1$ is nullable, we have
	$\text{If} \;\; r \stackrel{*}{\rightsquigarrow} r', \;\;\text{then} \;\; \bmkeps \; r = \bmkeps \; r'$.
	This proves the \emph{if} (interesting) branch of
	$\blexer \; r \; s = \blexersimp{r}{s}$.

\end{proof}
\noindent
As a corollary,
we link this result with the lemma we proved earlier that 
\begin{center}
	$(r, s) \rightarrow v \implies \blexer \; r \; s = v$
\end{center}
and obtain the corollary that the bit-coded lexer with simplification is
indeed correctly outputting POSIX lexing result, if such a result exists.
\begin{corollary}
	$(r, s) \rightarrow v \implies \blexersimp{r}{s}$
\end{corollary}

\subsection{Comments on the Proof Techniques Used}
The non-trivial part of proving the correctness of the algorithm with simplification
compared with not having simplification is that we can no longer use the argument 
in \cref{flex_retrieve}.
The function \retrieve needs the cumbersome structure of the (umsimplified)
annotated regular expression to 
agree with the structure of the value, but simplification will always mess with the 
structure.

We also tried to prove $\bsimp{\bderssimp{a}{s}} = \bsimp{a\backslash s}$,
but this turns out to be not true, A counterexample of this being
\[ r = [(1+c)\cdot [aa \cdot (1+c)]] \land s = aa
\]

Then we would have $\bsimp{a \backslash s}$ being 
$_{[]}(_{ZZ}\ONE +  _{ZS}c ) $
whereas $\bsimp{\bderssimp{a}{s}}$ would be 
$_{Z}(_{Z} \ONE + _{S} c)$.
Unfortunately if we apply $\textit{bsimp}$ at different
stages we will always have this discrepancy, due to 
whether the $\map \; (\fuse\; bs) \; as$ operation in $\textit{bsimp}$
is taken at some points will be entirely dependant on when the simplification 
take place whether there is a larger alternative structure surrounding the 
alternative being simplified.
The good thing about $\stackrel{*}{\rightsquigarrow} $ is that it allows
us not specify how exactly the "atomic" simplification steps $\rightsquigarrow$
are taken, but simply say that they can be taken to make two similar 
regular expressions equal, and can be done after interleaving derivatives
and simplifications.


Having correctness property is good. But we would also like the lexer to be efficient in 
some sense, for exampe, not grinding to a halt at certain cases.
In the next chapter we shall prove that for a given $r$, the internal derivative size is always
finitely bounded by a constant.
author	Chengsong
	Wed, 17 Aug 2022 01:09:13 +0100
changeset 579	35df9cdd36ca
parent 576	3e1b699696b6
child 582	3e19073e91f4
permissions	-rwxr-xr-x