lexing: ChengsongTanPhdThesis/Chapters/Bitcoded2.tex@ef2b8abcbc55


% Chapter Template

% Main chapter title
\chapter{Correctness of Bit-coded Algorithm with Simplification}

\label{Bitcoded2} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX}
%Then we illustrate how the algorithm without bitcodes falls short for such aggressive 
%simplifications and therefore introduce our version of the bitcoded algorithm and 
%its correctness proof in 
%Chapter 3\ref{Chapter3}. 
\section{Overview}

This chapter
is the point from which novel contributions of this PhD project are introduced
in detail, 
and previous
chapters are essential background work for setting the scene of the formal proof we
are about to describe.
In particular, the correctness theorem 
of the un-optimised bit-coded lexer $\blexer$ in 
chapter \ref{Bitcoded1} formalised by Ausaf et al.
relies on lemma \ref{retrieveStepwise} that says
any value can be retrieved in a stepwise manner:
\begin{center}	
	$\vdash v : (r\backslash c) \implies \retrieve \; (r \backslash c)  \;  v= \retrieve \; r \; (\inj \; r\; c\; v)$
\end{center}
This no longer holds once we introduce simplifications.
To control the size of regular expressions during derivatives, 
one has to eliminate redundant sub-expression with some
procedure we call $\textit{simp}$, 
and $\textit{simp}$ is defined as
:



Having defined the $\textit{bsimp}$ function,
we add it as a phase after a derivative is taken.
\begin{center}
	\begin{tabular}{lcl}
		$r \backslash_{bsimp} c$ & $\dn$ & $\textit{bsimp}(r \backslash c)$
	\end{tabular}
\end{center}
%Following previous notations
%when extending from derivatives w.r.t.~character to derivative
%w.r.t.~string, we define the derivative that nests simplifications 
%with derivatives:%\comment{simp in  the [] case?}
We extend this from characters to strings:
\begin{center}
\begin{tabular}{lcl}
$r \backslash_{bsimps} (c\!::\!s) $ & $\dn$ & $(r \backslash_{bsimp}\, c) \backslash_{bsimps}\, s$ \\
$r \backslash_{bsimps} [\,] $ & $\dn$ & $r$
\end{tabular}
\end{center}

\noindent
The lexer that extracts bitcodes from the 
derivatives with simplifications from our $\simp$ function
is called $\blexersimp$:

\begin{center}

\begin{center}
	\begin{tabular}{lcl}
		$r \backslash_{bsimp} s$ & $\dn$ & $\textit{bsimp}(r \backslash s)$
	\end{tabular}
\end{center}
\begin{tabular}{lcl}
  $\textit{blexer\_simp}\;r\,s$ & $\dn$ &
      $\textit{let}\;a = (r^\uparrow)\backslash_{bsimp}\, s\;\textit{in}$\\                
  & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
  & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
  & & $\;\;\textit{else}\;\textit{None}$
\end{tabular}
\end{center}
\noindent
the redundant sub-expressions after each derivative operation
allows the exact structure of each intermediate result to be preserved,
so that pairs of inhabitation relations in the form $\vdash v : r_{c} $ and
$\vdash v^{c} : r $ hold (if we allow the abbreviation $r_{c} \dn r\backslash c$
and $v^{c} \dn \inj \;r \; c \; v$).


Define the 
But $\blexersimp$ introduces simplification after the derivative
to reduce redundancy,
yielding $r \backslash c$ 
This allows 

The proof details are necessary materials for this thesis
because it provides necessary context to explain why we need a
new framework for the proof of $\blexersimp$, which involves
simplifications that cause structural changes to the regular expression.
a new formal proof of the correctness of $\blexersimp$, where the 
proof of $\blexer$
is not applicatble in the sense that we cannot straightforwardly extend the
proof of theorem \ref{blexerCorrect} because lemma \ref{flex_retrieve} does
not hold anymore.
This is because the structural induction on the stepwise correctness
of $\inj$ breaks due to each pair of $r_i$ and $v_i$ described
in chapter \ref{Inj} and \ref{Bitcoded1} no longer correspond to
each other.
In this chapter we introduce simplifications
for annotated regular expressions that can be applied to 
each intermediate derivative result. This allows
us to make $\blexer$ much more efficient.
Sulzmann and Lu already introduced some simplifications for bitcoded regular expressions,
but their simplification functions could have been more efficient and in some cases needed fixing.




From this chapter we start with the main contribution of this thesis, which

o
In particular, the $\blexer$ proof relies on a lockstep POSIX
correspondence between the lexical value and the
regular expression in each derivative and injection.


which is essential for getting an understanding this thesis
in chapter \ref{Bitcoded1}, which is necessary for understanding why
the proof 

In this chapter,

%We contrast our simplification function 
%with Sulzmann and Lu's, indicating the simplicity of our algorithm.
%This is another case for the usefulness 
%and reliability of formal proofs on algorithms.
%These ``aggressive'' simplifications would not be possible in the injection-based 
%lexing we introduced in chapter \ref{Inj}.
%We then prove the correctness with the improved version of 
%$\blexer$, called $\blexersimp$, by establishing 
%$\blexer \; r \; s= \blexersimp \; r \; s$ using a term rewriting system.
%
\section{Simplifications by Sulzmann and Lu}
The algorithms $\lexer$ and $\blexer$ work beautifully as functional 
programs, but not as practical code. One main reason for the slowness is due
to the size of intermediate representations--the derivative regular expressions
tend to grow unbounded if the matching involved a large number of possible matches.
Consider the derivatives of the following example $(a^*a^*)^*$:
%and $(a^* + (aa)^*)^*$:
\begin{center}
	\begin{tabular}{lcl}
		$(a^*a^*)^*$ & $ \stackrel{\backslash a}{\longrightarrow}$ & 
		$ (a^*a^* + a^*)\cdot(a^*a^*)^*$\\
			     & 
		$ \stackrel{\backslash a}{\longrightarrow} $ & 
	$((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*$\\
							     & $\stackrel{\backslash a}{
	\longrightarrow} $ & $\ldots$\\
	\end{tabular}
\end{center}
\noindent
As can be seen, there are several duplications.
A simple-minded simplification function cannot simplify
the third regular expression in the above chain of derivative
regular expressions, namely
\begin{center}
$((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*$
\end{center}
because the duplicates are
not next to each other, and therefore the rule
$r+ r \rightarrow r$ from $\textit{simp}$ does not fire.
One would expect a better simplification function to work in the 
following way:
\begin{gather*}
	((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* + 
	\underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}.\\
	\bigg\downarrow (1) \\
	(a^*a^* + a^* 
	\color{gray} + a^* \color{black})\cdot(a^*a^*)^* + 
	\underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this} \\
	\bigg\downarrow (2) \\
	(a^*a^* + a^* 
	)\cdot(a^*a^*)^*  
	\color{gray} + (a^*a^* + a^*) \cdot(a^*a^*)^*\\
	\bigg\downarrow (3) \\
	(a^*a^* + a^* 
	)\cdot(a^*a^*)^*  
\end{gather*}
\noindent
In the first step, the nested alternative regular expression
$(a^*a^* + a^*) + a^*$ is flattened into $a^*a^* + a^* + a^*$.
Now the third term $a^*$ can clearly be identified as a duplicate
and therefore removed in the second step. 
This causes the two
top-level terms to become the same and the second $(a^*a^*+a^*)\cdot(a^*a^*)^*$ 
removed in the final step.
Sulzmann and Lu's simplification function (using our notations) can achieve this
simplification:
\begin{center}
	\begin{tabular}{lcl}
		$\textit{simp}\_{SL} \; _{bs}(_{bs'}\ONE \cdot r)$ & $\dn$ & 
		$\textit{if} \; (\textit{zeroable} \; r)\; \textit{then} \;\; \ZERO$\\
						   & &$\textit{else}\;\; \fuse \; (bs@ bs') \; r$\\
		$\textit{simp}\_{SL} \;(_{bs}r_1\cdot r_2)$ & $\dn$ & $\textit{if} 
		\; (\textit{zeroable} \; r_1 \; \textit{or} \; \textit{zeroable}\; r_2)\;
		\textit{then} \;\; \ZERO$\\
							    & & $\textit{else}\;\;_{bs}((\textit{simp}\_{SL} \;r_1)\cdot
							    (\textit{simp}\_{SL} \; r_2))$\\
		$\textit{simp}\_{SL}  \; _{bs}\sum []$ & $\dn$ & $\ZERO$\\
		$\textit{simp}\_{SL}  \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ &
		$_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\
		$\textit{simp}\_{SL}  \; _{bs}\sum[r]$ & $\dn$ & $\fuse \; bs \; (\textit{simp}\_{SL}  \; r)$\\
		$\textit{simp}\_{SL}  \; _{bs}\sum(r::rs)$ & $\dn$ & $_{bs}\sum 
		(\nub \; (\filter \; (\neg\zeroable)\;((\textit{simp}\_{SL}  \; r) :: \map \; \textit{simp}\_{SL}  \; rs)))$\\ 
		
	\end{tabular}
\end{center}
\noindent
The $\textit{zeroable}$ predicate 
tests whether the regular expression
is equivalent to $\ZERO$, and
can be defined as:
\begin{center}
	\begin{tabular}{lcl}
		$\zeroable \; _{bs}\sum (r::rs)$ & $\dn$ & $\zeroable \; r\;\; \land \;\;
		\zeroable \;_{[]}\sum\;rs $\\
		$\zeroable\;_{bs}(r_1 \cdot r_2)$ & $\dn$ & $\zeroable\; r_1 \;\; \lor \;\; \zeroable \; r_2$\\
		$\zeroable\;_{bs}r^*$ & $\dn$ & $\textit{false}$ \\
		$\zeroable\;_{bs}c$ & $\dn$ & $\textit{false}$\\
		$\zeroable\;_{bs}\ONE$ & $\dn$ & $\textit{false}$\\
		$\zeroable\;_{bs}\ZERO$ & $\dn$ & $\textit{true}$
	\end{tabular}
\end{center}
\noindent
The 
\begin{center}
	\begin{tabular}{lcl}
		$\textit{simp}\_{SL}  \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ &
		$_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\
	\end{tabular}
\end{center}
\noindent
clause does flatten the alternative as required in step (1),
but $\textit{simp}\_{SL}$ is insufficient if we want to do steps (2) and (3),
as these ``identical'' terms have different bit-annotations.
They also suggested that the $\textit{simp}\_{SL} $ function should be
applied repeatedly until a fixpoint is reached.
We call this construction $\textit{SLSimp}$:
\begin{center}
	\begin{tabular}{lcl}
		$\textit{SLSimp} \; r$ & $\dn$ & 
		$\textit{while}((\textit{simp}\_{SL}  \; r)\; \cancel{=} \; r)$ \\
					 & & $\quad r := \textit{simp}\_{SL}  \; r$\\
		& & $\textit{return} \; r$
	\end{tabular}
\end{center}
We call the operation of alternatingly 
applying derivatives and simplifications
(until the string is exhausted) Sulz-simp-derivative,
written $\backslash_{SLSimp}$:
\begin{center}
\begin{tabular}{lcl}
	$r \backslash_{SLSimp} (c\!::\!s) $ & $\dn$ & $(\textit{SLSimp} \; (r \backslash c)) \backslash_{SLSimp}\, s$ \\
$r \backslash_{SLSimp} [\,] $ & $\dn$ & $r$
\end{tabular}
\end{center}
\noindent
After the derivatives have been taken, the bitcodes
are extracted and decoded in the same manner
as $\blexer$:
\begin{center}
\begin{tabular}{lcl}
  $\textit{blexer\_SLSimp}\;r\,s$ & $\dn$ &
      $\textit{let}\;a = (r^\uparrow)\backslash_{SLSimp}\, s\;\textit{in}$\\                
  & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
  & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
  & & $\;\;\textit{else}\;\textit{None}$
\end{tabular}
\end{center}
\noindent
We implemented this lexing algorithm in Scala, 
and found that the final derivative regular expression
size still grows exponentially (note the logarithmic scale):
\begin{figure}[H]
	\centering
\begin{tikzpicture}
\begin{axis}[
    xlabel={$n$},
    ylabel={size},
    ymode = log,
    legend entries={Final Derivative Size},  
    legend pos=north west,
    legend cell align=left]
\addplot[red,mark=*, mark options={fill=white}] table {SulzmannLuLexer.data};
\end{axis}
\end{tikzpicture} 
\caption{Lexing the regular expression $(a^*a^*)^*$ against strings of the form
$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}
$ using Sulzmann and Lu's lexer}\label{SulzmannLuLexer}
\end{figure}
\noindent
At $n= 20$ we already get an out-of-memory error with Scala's normal 
JVM heap size settings.
In fact their simplification does not improve much over
the simple-minded simplifications we have shown in \ref{fig:BetterWaterloo}.
The time required also grows exponentially:
\begin{figure}[H]
	\centering
\begin{tikzpicture}
\begin{axis}[
    xlabel={$n$},
    ylabel={time},
    %ymode = log,
    legend entries={time in secs},  
    legend pos=north west,
    legend cell align=left]
\addplot[red,mark=*, mark options={fill=white}] table {SulzmannLuLexerTime.data};
\end{axis}
\end{tikzpicture} 
\caption{Lexing the regular expression $(a^*a^*)^*$ against strings of the form
$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}
$ using Sulzmann and Lu's lexer}\label{SulzmannLuLexerTime}
\end{figure}
\noindent
which seems like a counterexample for 
Sulzmann and Lu's linear complexity claim
in their paper \cite{Sulzmann2014}:
\begin{quote}\it
``Linear-Time Complexity Claim \\It is easy to see that each call of one of the functions/operations:
simp, fuse, mkEpsBC and isPhi leads to subcalls whose number is bound by the size of the regular expression involved. We claim that thanks to aggressively applying simp this size remains finite. Hence, we can argue that the above mentioned functions/operations have constant time complexity which implies that we can incrementally compute bit-coded parse trees in linear time in the size of the input.'' 
\end{quote}
\noindent
The assumption that the size of the regular expressions
in the algorithm
would stay below a finite constant is not true, at least not in the
examples we considered.
The main reason behind this is that (i) Haskell's $\textit{nub}$
function requires identical annotations between two 
annotated regular expressions to qualify as duplicates,
and therefore cannot simplify cases like $_{SZZ}a^*+_{SZS}a^*$
even if both $a^*$ denote the same language, and
(ii) the ``flattening'' only applies to the head of the list
in the 
\begin{center}
	\begin{tabular}{lcl}

		$\textit{simp}\_{SL}  \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ &
		$_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\
	\end{tabular}
\end{center}
\noindent
clause, and therefore is not strong enough to simplify all
needed parts of the regular expression. Moreover,
the $\textit{simp}\_{SL}$ function is applied repeatedly
in each derivative step until a fixed point is reached, 
which makes the algorithm even more
unpredictable and inefficient.
%To not get ``caught off guard'' by
%these counterexamples,
%one needs to be more careful when designing the
%simplification function and making claims about them.

\section{Our $\textit{Simp}$ Function}
We will now introduce our own simplification function.
%by making a contrast with $\textit{simp}\_{SL}$.
We also describe
the ideas behind Sulzmann and Lu's $\textit{simp}\_{SL}$
algorithm 
and why it fails to achieve the desired effect of keeping the sizes of derivatives finitely bounded. 
In addition, our simplification function will come with a formal
correctness proof.
\subsection{Flattening Nested Alternatives}
The idea behind the clause
\begin{center}
	$\textit{simp}\_{SL}  \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2) \quad \dn \quad
	       _{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$
\end{center}
is that it allows
duplicate removal of regular expressions at different
``levels'' of alternatives.
For example, this would help with the
following simplification:

\begin{center}
$(a+r)+r \longrightarrow a+r$
\end{center}
The problem is that only the head element
is ``spilled out''.
It is more desirable
to flatten
an entire list to open up possibilities for further simplifications
with later regular expressions.
Not flattening the rest of the elements also means that
the later de-duplication process 
does not fully remove further duplicates.
For example,
using $\textit{simp}\_{SL}$ we cannot
simplify
\begin{center}
	$((a^* a^*)+\underline{(a^* + a^*)})\cdot (a^*a^*)^*+
((a^*a^*)+a^*)\cdot (a^*a^*)^*$
\end{center}
due to the underlined part not being the head 
of the alternative.

We define our flatten operation so that it flattens 
the entire list: 
 \begin{center}
  \begin{tabular}{@{}lcl@{}}
  $\textit{flts} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \;
     (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flts} \; as' $ \\
  $\textit{flts} \; \ZERO :: as'$ & $\dn$ & $ \textit{flts} \;  \textit{as'} $ \\
    $\textit{flts} \; a :: as'$ & $\dn$ & $a :: \textit{flts} \; \textit{as'}$ \quad(otherwise) 
\end{tabular}    
\end{center}  
\noindent
Our $\flts$ operation 
also throws away $\ZERO$s
as they do not contribute to a lexing result.
\subsection{Duplicate Removal}
After flattening is done, we can deduplicate.
The de-duplicate function is called $\distinctBy$,
and that is where we make our second improvement over
Sulzmann and Lu's simplification method.
The process goes as follows:
\begin{center}
$rs \stackrel{\textit{flts}}{\longrightarrow} 
rs_{flat} 
\xrightarrow{\distinctBy \; 
rs_{flat} \; \rerases\; \varnothing} rs_{distinct}$
%\stackrel{\distinctBy \; 
%rs_{flat} \; \erase\; \varnothing}{\longrightarrow} \; rs_{distinct}$
\end{center}
where the $\distinctBy$ function is defined as:
\begin{center}
	\begin{tabular}{@{}lcl@{}}
		$\distinctBy \; [] \; f\; acc $ & $ =$ & $ []$\\
		$\distinctBy \; (x :: xs) \; f \; acc$ & $=$ & $\quad \textit{if} (f \; x \in acc)\;\; \textit{then} \;\; \distinctBy \; xs \; f \; acc$\\
						       & & $\quad \textit{else}\;\; x :: (\distinctBy \; xs \; f \; (\{f \; x\} \cup acc))$ 
	\end{tabular}
\end{center}
\noindent
The reason we define a distinct function under a mapping $f$ is because
we want to eliminate regular expressions that are syntactically the same,
but have different bit-codes.
For example, we can remove the second $a^*a^*$ from
$_{ZSZ}a^*a^* + _{SZZ}a^*a^*$, because it
represents a match with shorter initial sub-match 
(and therefore is definitely not POSIX),
and will be discarded by $\bmkeps$ later.
\begin{center}
	$_{ZSZ}\underbrace{a^*}_{ZS:\; match \; 1\; times\quad}\underbrace{a^*}_{Z: \;match\; 1 \;times} + 
	_{SZZ}\underbrace{a^*}_{S: \; match \; 0 \; times\quad}\underbrace{a^*}_{ZZ: \; match \; 2 \; times}
	$
\end{center}
%$_{bs1} r_1 + _{bs2} r_2 \text{where} (r_1)_{\downarrow} = (r_2)_{\downarrow}$
Due to the way our algorithm works,
the matches that conform to the POSIX standard 
will always be placed further to the left. When we 
traverse the list from left to right,
regular expressions we have already seen
will definitely not contribute to a POSIX value,
even if they are attached with different bitcodes.
These duplicates therefore need to be removed.
To achieve this, we call $\rerases$ as the function $f$ during the distinction
operation. The function
$\rerases$ is very similar to $\erase$, except that it preserves the structure
when erasing an alternative regular expression.
The reason why we use $\rerases$ instead of $\erase$ is that
it keeps the structures of alternative 
annotated regular expressions
whereas $\erase$ would turn it back into a binary  tree structure.
Not having to mess with the structure 
greatly simplifies the finiteness proof in chapter 
\ref{Finite}.
We give the definitions of $\rerases$ here together with
the new datatype used by $\rerases$ (as our plain
regular expression datatype does not allow non-binary alternatives).
For now we can think of 
$\rerases$ as the function $(\_)_\downarrow$ defined in chapter \ref{Bitcoded1}
and $\rrexp$ as plain regular expressions, but having a general list constructor
for alternatives:
\begin{figure}[H]
\begin{center}	
	$\rrexp ::=   \RZERO \mid  \RONE
			 \mid  \RCHAR{c}  
			 \mid  \RSEQ{r_1}{r_2}
			 \mid  \RALTS{rs}
			 \mid \RSTAR{r}        $
\end{center}
\caption{$\rrexp$: plain regular expressions, but with $\sum$ alternative 
constructor}\label{rrexpDef}
\end{figure}
The function $\rerases$ we define as follows:
\begin{center}
\begin{tabular}{lcl}
$\rerase{\ZERO}$ & $\dn$ & $\RZERO$\\
$\rerase{_{bs}\ONE}$ & $\dn$ & $\RONE$\\
	$\rerase{_{bs}\mathbf{c}}$ & $\dn$ & $\RCHAR{c}$\\
$\rerase{_{bs}r_1\cdot r_2}$ & $\dn$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\
$\rerase{_{bs}\sum as}$ & $\dn$ & $\RALTS{\map \; \rerase{\_} \; as}$\\
$\rerase{_{bs} a ^*}$ & $\dn$ & $\rerase{a}^*$
\end{tabular}
\end{center}

\subsection{Putting Things Together}
We can now give the definition of our  simplification function:
%that looks somewhat similar to our Scala code is 
\begin{center}
  \begin{tabular}{@{}lcl@{}}
   
	  $\textit{bsimp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ \textit{bsimp}_{ASEQ} \; bs \;(\textit{bsimp} \; a_1) \; (\textit{bsimp}  \; a_2)  $ \\
	  $\textit{bsimp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{bsimp}_{ALTS} \; \textit{bs} \; (\textit{distinctBy} \; ( \textit{flatten} ( \textit{map} \; bsimp \; as)) \; \rerases \; \varnothing) $ \\
   $\textit{bsimp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$   
\end{tabular}    
\end{center}    

\noindent
The simplification (named $\textit{bsimp}$ for \emph{b}it-coded) 
does a pattern matching on the regular expression.
When it detects that the regular expression is an alternative or
sequence, it will try to simplify its children regular expressions
recursively and then see if one of the children turns into $\ZERO$ or
$\ONE$, which might trigger further simplification at the current level.
Current level simplifications are handled by the function $\textit{bsimp}_{ASEQ}$,
using rules such as  $\ZERO \cdot r \rightarrow \ZERO$ and $\ONE \cdot r \rightarrow r$.
\begin{center}
	\begin{tabular}{@{}lcl@{}}
		$\textit{bsimp}_{ASEQ} \; bs\; a \; b$ & $\dn$ & $ (a,\; b) \textit{match}$\\
   &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow  \ZERO$ \\
   &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow  \ZERO$ \\
   &&$\quad\textit{case} \;  (_{bs1}\ONE, a_2') \Rightarrow  \textit{fuse} \; (bs@bs_1) \;  a_2'$ \\
   &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow   _{bs}a_1' \cdot a_2'$ 
	\end{tabular}
\end{center}
\noindent
The most involved part is the $\sum$ clause, where we first call $\flts$ on
the simplified children regular expression list $\textit{map}\; \textit{bsimp}\; \textit{as}$,
and then call $\distinctBy$ on that list. The predicate used in $\distinctBy$ for determining whether two 
elements are the same is $\rerases \; r_1 = \rerases\; r_2$.
Finally, depending on whether the regular expression list $as'$ has turned into a
singleton or empty list after $\flts$ and $\distinctBy$, $\textit{bsimp}_{ALTS}$
decides whether to keep the current level constructor $\sum$ as it is, and 
removes it when there are fewer than two elements:
\begin{center}
	\begin{tabular}{lcl}
		$\textit{bsimp}_{ALTS} \; bs \; as'$ & $ \dn$ & $ as' \; \textit{match}$\\		
  &&$\quad\textit{case} \; [] \Rightarrow  \ZERO$ \\
   &&$\quad\textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
   &&$\quad\textit{case} \;  as' \Rightarrow _{bs}\sum \textit{as'}$\\ 
	\end{tabular}
	
\end{center}
Having defined the $\textit{bsimp}$ function,
we add it as a phase after a derivative is taken.
\begin{center}
	\begin{tabular}{lcl}
		$r \backslash_{bsimp} c$ & $\dn$ & $\textit{bsimp}(r \backslash c)$
	\end{tabular}
\end{center}
%Following previous notations
%when extending from derivatives w.r.t.~character to derivative
%w.r.t.~string, we define the derivative that nests simplifications 
%with derivatives:%\comment{simp in  the [] case?}
We extend this from characters to strings:
\begin{center}
\begin{tabular}{lcl}
$r \backslash_{bsimps} (c\!::\!s) $ & $\dn$ & $(r \backslash_{bsimp}\, c) \backslash_{bsimps}\, s$ \\
$r \backslash_{bsimps} [\,] $ & $\dn$ & $r$
\end{tabular}
\end{center}

\noindent
The lexer that extracts bitcodes from the 
derivatives with simplifications from our $\simp$ function
is called $\blexersimp$:
\begin{center}
\begin{tabular}{lcl}
  $\textit{blexer\_simp}\;r\,s$ & $\dn$ &
      $\textit{let}\;a = (r^\uparrow)\backslash_{bsimp}\, s\;\textit{in}$\\                
  & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
  & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
  & & $\;\;\textit{else}\;\textit{None}$
\end{tabular}
\end{center}
\noindent
This algorithm keeps the regular expression size small, 
as we shall demonstrate with some examples in the next section.


\subsection{Examples $(a+aa)^*$ and $(a^*\cdot a^*)^*$
After Simplification}
Recall the
previous $(a^*a^*)^*$ example
where $\textit{simp}\_{SL}$ could not
prevent the fast growth (over
3 million nodes just below $20$ input length)
will be reduced to just 15 and stays constant no matter how long the
input string is.
This is shown in the graphs below.
\begin{figure}[H]
\begin{center}
\begin{tabular}{ll}
\begin{tikzpicture}
\begin{axis}[
    xlabel={$n$},
    ylabel={derivative size},
        width=7cm,
    height=4cm, 
    legend entries={Lexer with $\textit{bsimp}$},  
    legend pos=  south east,
    legend cell align=left]
\addplot[red,mark=*, mark options={fill=white}] table {BitcodedLexer.data};
\end{axis}
\end{tikzpicture} %\label{fig:BitcodedLexer}
&
\begin{tikzpicture}
\begin{axis}[
    xlabel={$n$},
    ylabel={derivative size},
    width = 7cm,
    height = 4cm,
    legend entries={Lexer with $\textit{simp}\_{SL}$},  
    legend pos=  north west,
    legend cell align=left]
\addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data};
\end{axis}
\end{tikzpicture} 
\end{tabular}
\end{center}
\caption{Our Improvement over Sulzmann and Lu's in terms of size of the derivatives.}
\end{figure}
\noindent
Given the size difference, it is not
surprising that our $\blexersimp$ significantly outperforms
$\textit{blexer\_SLSimp}$ by Sulzmann and Lu.
In the next section we are going to establish that our
simplification preserves the correctness of the algorithm.
%----------------------------------------------------------------------------------------
%	SECTION rewrite relation
%----------------------------------------------------------------------------------------
\section{Correctness of $\blexersimp$}
We first introduce the rewriting relation \emph{rrewrite}
($\rrewrite$) between two regular expressions,
which stands for an atomic
simplification.
We then prove properties about
this rewriting relation and its reflexive transitive closure.
Finally we leverage these properties to show
an equivalence between the results generated by
$\blexer$ and $\blexersimp$.

\subsection{The Rewriting Relation $\rrewrite$($\rightsquigarrow$)}
In the $\blexer$'s correctness proof, we
did not directly derive the fact that $\blexer$ generates the POSIX value,
but first proved that $\blexer$ generates the same result as $\lexer$.
Then we re-use
the correctness of $\lexer$
to obtain 
\begin{center}
	$(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer \; r \;s = v$\\
	$\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer\;
	r\;s = \None$.
\end{center}
%\begin{center}
%	$(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer \; r \;s = v$.
%\end{center}
Here we apply this
modularised technique again
by first proving that
$\blexersimp \; r \; s $ 
produces the same output as $\blexer \; r\; s$,
and then piecing it together with 
$\blexer$'s correctness to achieve our main
theorem:
\begin{center}
	$(r, s) \rightarrow v \; \;   \textit{iff} \;\;  \blexersimp \; r \; s = \Some \;v$
	\\
	$\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp\;
	r\;s = \None$
\end{center}
\noindent
The overall idea for the proof
of $\blexer \;r \;s = \blexersimp \; r \;s$ 
is that the transition from $r$ to $\textit{bsimp}\; r$ can be
broken down into smaller rewrite steps of the form:
\begin{center}
	$r \rightsquigarrow^* \textit{bsimp} \; r$
\end{center}
where each rewrite step, written $\rightsquigarrow$,
is an ``atomic'' simplification that
is similar to a small-step reduction in operational semantics (
see figure \ref{rrewriteRules} for the rules):
\begin{figure}[H]
\begin{mathpar}
	\inferrule * [Right = $S\ZERO_l$]{\vspace{0em}}{_{bs} \ZERO \cdot r_2 \rightsquigarrow \ZERO\\}

	\inferrule * [Right = $S\ZERO_r$]{\vspace{0em}}{_{bs} r_1 \cdot \ZERO \rightsquigarrow \ZERO\\}

	\inferrule * [Right = $S_1$]{\vspace{0em}}{_{bs1} ((_{bs2} \ONE) \cdot r) \rightsquigarrow \fuse \; (bs_1 @ bs_2) \; r\\}\\

	
	
	\inferrule * [Right = $SL$] {\\ r_1 \rightsquigarrow r_2}{_{bs} r_1 \cdot r_3 \rightsquigarrow _{bs} r_2 \cdot r_3\\}

	\inferrule * [Right = $SR$] {\\ r_3 \rightsquigarrow r_4}{_{bs} r_1 \cdot r_3 \rightsquigarrow _{bs} r_1 \cdot r_4\\}\\

	\inferrule * [Right = $A0$] {\vspace{0em}}{ _{bs}\sum [] \rightsquigarrow \ZERO}

	\inferrule * [Right = $A1$] {\vspace{0em}}{ _{bs}\sum [a] \rightsquigarrow \fuse \; bs \; a}

	\inferrule * [Right = $AL$] {\\ rs_1 \stackrel{s}{\rightsquigarrow} rs_2}{_{bs}\sum rs_1 \rightsquigarrow rs_2}

	\inferrule * [Right = $LE$] {\vspace{0em}}{ [] \stackrel{s}{\rightsquigarrow} []}

	\inferrule * [Right = $LT$] {rs_1 \stackrel{s}{\rightsquigarrow} rs_2}{ r :: rs_1 \stackrel{s}{\rightsquigarrow} r :: rs_2 }

	\inferrule * [Right = $LH$] {r_1 \rightsquigarrow r_2}{ r_1 :: rs \stackrel{s}{\rightsquigarrow} r_2 :: rs}

	\inferrule * [Right = $L\ZERO$] {\vspace{0em}}{\ZERO :: rs \stackrel{s}{\rightsquigarrow} rs}

	\inferrule * [Right = $LS$] {\vspace{0em}}{_{bs} \sum (rs_1 :: rs_b) \stackrel{s}{\rightsquigarrow} ((\map \; (\fuse \; bs_1) \; rs_1) @ rsb) }

	\inferrule * [Right = $LD$] {\\ \rerase{a_1} = \rerase{a_2}}{rs_a @ [a_1] @ rs_b @ [a_2] @ rs_c \stackrel{s}{\rightsquigarrow} rs_a @ [a_1] @ rs_b @ rs_c}

\end{mathpar}
\caption{
The rewrite rules that generate simplified regular expressions 
in small steps: $r_1 \rightsquigarrow r_2$ is for bitcoded regular expressions 
and $rs_1 \stackrel{s}{\rightsquigarrow} rs_2$ for 
lists of bitcoded regular expressions. 
Interesting is the LD rule that allows copies of regular expressions 
to be removed provided a regular expression 
earlier in the list can match the same strings.
}\label{rrewriteRules}
\end{figure}
\noindent
The rules $LT$ and $LH$ are for rewriting two regular expression lists
such that one regular expression
in the left-hand-side list is rewritable in one step
to the right-hand side's regular expression at the same position.
This helps with defining the ``context rule'' $AL$.

The reflexive transitive closure of $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$
are defined in the usual way:
\begin{figure}[H]
	\centering
\begin{mathpar}
	\inferrule{\vspace{0em}}{ r \rightsquigarrow^* r \\}

	\inferrule{\vspace{0em}}{rs \stackrel{s*}{\rightsquigarrow} rs \\}

	\inferrule{r_1 \rightsquigarrow^*  r_2 \land \; r_2 \rightsquigarrow^* r_3}{r_1 \rightsquigarrow^* r_3\\}
	
	\inferrule{rs_1 \stackrel{s*}{\rightsquigarrow}  rs_2 \land \; rs_2 \stackrel{s*}{\rightsquigarrow} rs_3}{rs_1 \stackrel{s*}{\rightsquigarrow} rs_3}
\end{mathpar}
\caption{The Reflexive Transitive Closure of 
$\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$}\label{transClosure}
\end{figure}
%Two rewritable terms will remain rewritable to each other
%even after a derivative is taken:
The main point of our rewriting relation
is that it is preserved under derivatives,
namely
\begin{center}
	$r_1 \rightsquigarrow r_2 \implies (r_1 \backslash c) \rightsquigarrow^* (r_2 \backslash c)$
\end{center}
And also, if two terms are rewritable to each other,
then they produce the same bitcodes:
\begin{center}
	$r \rightsquigarrow^* r' \;\; \textit{then} \; \; \bmkeps \; r = \bmkeps \; r'$
\end{center}
The decoding phase of both $\blexer$ and $\blexersimp$
are the same, which means that if they receive the same
bitcodes before the decoding phase,
they generate the same value after decoding is done.
We will prove the three properties 
we mentioned above in the next sub-section.
\subsection{Important Properties of $\rightsquigarrow$}
First we prove some basic facts 
about $\rightsquigarrow$, $\stackrel{s}{\rightsquigarrow}$, 
$\rightsquigarrow^*$ and $\stackrel{s*}{\rightsquigarrow}$,
which will be needed later.\\
The inference rules (\ref{rrewriteRules}) we 
gave in the previous section 
have their ``many-steps version'':

\begin{lemma}\label{squig1}
	\hspace{0em}
	\begin{itemize}
		\item
			$rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \implies _{bs} \sum rs_1 \stackrel{*}{\rightsquigarrow} _{bs} \sum rs_2$
		\item
			$r \rightsquigarrow^* r' \implies _{bs} \sum (r :: rs)\; \rightsquigarrow^*\;  _{bs} \sum (r' :: rs)$

		\item
			The rewriting in many steps property is composable 
			in terms of the sequence constructor:\\
			$r_1 \rightsquigarrow^* r_2 
			\implies _{bs} r_1 \cdot r_3 \rightsquigarrow^* \;  
			_{bs} r_2 \cdot r_3 \quad $ 
			and 
			$\quad r_3 \rightsquigarrow^* r_4 
			\implies _{bs} r_1 \cdot r_3 \rightsquigarrow^* _{bs} \; r_1 \cdot r_4$
		\item
			The rewriting in many steps properties 
			$\stackrel{*}{\rightsquigarrow}$ and $\stackrel{s*}{\rightsquigarrow}$ 
			is preserved under the function $\fuse$:\\
				$r_1 \rightsquigarrow^* r_2 
				\implies \fuse \; bs \; r_1 \rightsquigarrow^* \; 
				\fuse \; bs \; r_2 \quad  $ and 
				$rs_1 \stackrel{s}{\rightsquigarrow} rs_2 
				\implies \map \; (\fuse \; bs) \; rs_1 
				\stackrel{s*}{\rightsquigarrow} \map \; (\fuse \; bs) \; rs_2$
	\end{itemize}
\end{lemma}
\begin{proof}
	By an induction on 
	the inductive cases of $\stackrel{s*}{\rightsquigarrow}$ and $\rightsquigarrow^*$ respectively.
	The third and fourth points are 
	by the properties $r_1 \rightsquigarrow r_2 \implies \fuse \; bs \; r_1 \implies \fuse \; bs \; r_2$ and
	$rs_2 \stackrel{s}{\rightsquigarrow} rs_3 
	\implies \map \; (\fuse \; bs) rs_2 \stackrel{s*}{\rightsquigarrow} \map \; (\fuse \; bs)\; rs_3$,
	which can be inductively proven by the inductive cases of $\rightsquigarrow$ and 
	$\stackrel{s}{\rightsquigarrow}$.
\end{proof}
\noindent
The inference rules of $\stackrel{s}{\rightsquigarrow}$
are defined in terms of the list cons operation, where
we establish that the 
$\stackrel{s}{\rightsquigarrow}$ and $\stackrel{s*}{\rightsquigarrow}$ 
relation is also preserved w.r.t appending and prepending of a list.
In addition, we
also prove some relations 
between $\rightsquigarrow^*$ and $\stackrel{s*}{\rightsquigarrow}$.
\begin{lemma}\label{ssgqTossgs}
	\hspace{0em}
	\begin{itemize}
		\item
			$rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies rs @ rs_1 \stackrel{s}{\rightsquigarrow} rs @ rs_2$

		\item
			$rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \implies 
			rs @ rs_1 \stackrel{s*}{\rightsquigarrow} rs @ rs_2 \; \;
			\textit{and} \; \;
			rs_1 @ rs \stackrel{s*}{\rightsquigarrow} rs_2 @ rs$
			
		\item
			The $\stackrel{s}{\rightsquigarrow} $ relation after appending 
			a list becomes $\stackrel{s*}{\rightsquigarrow}$:\\
			$rs_1 \stackrel{s}{\rightsquigarrow} rs_2 
			\implies rs_1 @ rs \stackrel{s*}{\rightsquigarrow} rs_2 @ rs$
		\item
		
			$r_1 \rightsquigarrow^* r_2 \implies [r_1] \stackrel{s*}{\rightsquigarrow} [r_2]$
		\item
		
			$rs_3 \stackrel{s*}{\rightsquigarrow} rs_4 \land r_1 \rightsquigarrow^* r_2 \implies
			r_2 :: rs_3 \stackrel{s*}{\rightsquigarrow} r_2 :: rs_4$
		\item			
			If we can rewrite a regular expression 
			in many steps to $\ZERO$, then 
			we can also rewrite any sequence containing it to $\ZERO$:\\
			$r_1 \rightsquigarrow^* \ZERO 
			\implies _{bs}r_1\cdot r_2 \rightsquigarrow^* \ZERO$
	\end{itemize}
\end{lemma}
\begin{proof}
	The first part is by induction on the list $rs$.
	The second part is by induction on the inductive cases of $\stackrel{s*}{\rightsquigarrow}$.
	The third part is 
	by rule induction of $\stackrel{s}{\rightsquigarrow}$.
	The fourth sub-lemma is 
	by rule induction of 
	$\stackrel{s*}{\rightsquigarrow}$ and using part one to three. 
	The fifth part is a corollary of part four.
	The last part is proven by rule induction again on $\rightsquigarrow^*$.
\end{proof}
\noindent
Now we are ready to give the proofs of the following properties:
\begin{itemize}
	\item
		$r \rightsquigarrow^* r'\land \bnullable \; r_1 
		\implies \bmkeps \; r = \bmkeps \; r'$. \\
	\item
		$r \rightsquigarrow^* \textit{bsimp} \;r$.\\
	\item
		$r \rightsquigarrow r' \implies r \backslash c \rightsquigarrow^* r'\backslash c$.\\
\end{itemize}

\subsubsection{Property 1: $r \rightsquigarrow^* r'\land \bnullable \; r_1 
		\implies \bmkeps \; r = \bmkeps \; r'$}
Intuitively, this property says we can 
extract the same bitcodes using $\bmkeps$ from the nullable
components of two regular expressions $r$ and $r'$,
if we can rewrite from one to the other in finitely
many steps.

For convenience, 
we define a predicate for a list of regular expressions
having at least one nullable regular expression:
\begin{center}
	$\textit{bnullables} \; rs \quad \dn \quad \exists r \in rs. \;\; \bnullable \; r$
\end{center}
\noindent
The rewriting relation $\rightsquigarrow$ preserves (b)nullability:
\begin{lemma}\label{rewritesBnullable}
	\hspace{0em}
	\begin{itemize}
		\item
			$\text{If} \; r_1 \rightsquigarrow r_2, \; 
			\text{then} \; \bnullable \; r_1 = \bnullable \; r_2$
		\item 	
			$\text{If} \; rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \;
			\text{then} \; \textit{bnullables} \; rs_1 = \textit{bnullables} \; rs_2$
		\item
			$r_1 \rightsquigarrow^* r_2 
			\implies \bnullable \; r_1 = \bnullable \; r_2$
	\end{itemize}
\end{lemma}
\begin{proof}
	By rule induction of $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$.
	The third point is a result of the second.
\end{proof}
\noindent
For convenience again,
we define $\bmkepss$ on a list $rs$,
which extracts the bit-codes on the first $\bnullable$ element in $rs$:
\begin{center}
	\begin{tabular}{lcl}
		$\bmkepss \; [] $ & $\dn$ & $[]$\\
		$\bmkepss \; r :: rs$ & $\dn$ & $\textit{if} \;(\bnullable \; r) \;\; 
		\textit{then} \;\; \bmkeps \; r \; \textit{else} \;\; \bmkepss \; rs$
	\end{tabular}
\end{center}
\noindent
If both regular expressions in a rewriting relation are nullable, then they 
produce the same bitcodes:
\begin{lemma}\label{rewriteBmkepsAux}
	\hspace{0em}
	\begin{itemize}
		\item
			$r_1 \rightsquigarrow r_2 \implies 
			(\bnullable \; r_1 \land \bnullable \; r_2 \implies \bmkeps \; r_1 = 
			\bmkeps \; r_2)$ 
		\item
			and
			$rs_ 1 \stackrel{s}{\rightsquigarrow} rs_2 
			\implies (\bnullables \; rs_1 \land \bnullables \; rs_2 \implies 
			\bmkepss \; rs_1 = \bmkepss \; rs2)$
	\end{itemize}
\end{lemma}
\begin{proof}
	By rule induction over the cases that lead to $r_1 \rightsquigarrow r_2$.
\end{proof}
\noindent
With lemma \ref{rewriteBmkepsAux} in place we are ready to prove its
many-step version: 
\begin{lemma}
	$\text{If} \;\; r \stackrel{*}{\rightsquigarrow} r' \;\; \text{and} \;\; \bnullable \; r, \;\;\; \text{then} \;\; \bmkeps \; r = \bmkeps \; r'$
\end{lemma}
\begin{proof}
	By rule induction of $\stackrel{*}{\rightsquigarrow} $. Lemma 
	$\ref{rewritesBnullable}$ gives us both $r$ and $r'$ are nullable.
	The lemma \ref{rewriteBmkepsAux} solves the inductive case.
\end{proof}

\subsubsection{Property 2: $r \stackrel{*}{\rightsquigarrow} \textit{bsimp} \; r$}
Now we get to the key part of the proof, 
which says that our simplification's helper functions 
such as $\distinctBy$ and $\flts$ describe
reducts of $\stackrel{s*}{\rightsquigarrow}$ and 
$\rightsquigarrow^* $.

The first lemma to prove is a more general version of 
$rs_ 1 \rightsquigarrow^* \distinctBy \; rs_1 \; \phi$:
\begin{lemma}
	$rs_1 @ rs_2 \stackrel{s*}{\rightsquigarrow} 
	(rs_1 @ (\distinctBy \; rs_2 \; \; \rerases \;\; (\map\;\; \rerases \; \; rs_1)))$
\end{lemma}
\noindent
It says that for a list made of two parts $rs_1 @ rs_2$, 
one can throw away the duplicate
elements in $rs_2$, as well as those that have appeared in $rs_1$.
\begin{proof}
	By induction on $rs_2$, where $rs_1$ is allowed to be arbitrary.
\end{proof}
\noindent
Setting $rs_2$ to be empty,
we get the corollary
\begin{corollary}\label{dBPreserves}
	$rs_1 \stackrel{s*}{\rightsquigarrow} \distinctBy \; rs_1 \; \phi$.
\end{corollary}
\noindent
Similarly the flatten function $\flts$ describes a reduct of
$\stackrel{s*}{\rightsquigarrow}$ as well:

\begin{lemma}\label{fltsPreserves}
	$rs \stackrel{s*}{\rightsquigarrow} \flts \; rs$
\end{lemma}
\begin{proof}
	By an induction on $rs$.
\end{proof}
\noindent
The function $\bsimpalts$ preserves rewritability:
\begin{lemma}\label{bsimpaltsPreserves}
	$_{bs} \sum rs \stackrel{*}{\rightsquigarrow} \bsimpalts \; _{bs} \; rs$
\end{lemma}
\noindent
The simplification function
$\textit{bsimp}$ only transforms the regular expression  using steps specified by 
$\rightsquigarrow^*$ and nothing else:
\begin{lemma}
	$r \stackrel{*}{\rightsquigarrow} \textit{bsimp} \; r$
\end{lemma}
\begin{proof}
	By an induction on $r$.
	The most involved case is the alternative, 
	where we use lemmas \ref{bsimpaltsPreserves},
	\ref{fltsPreserves} and \ref{dBPreserves} to do a series of rewriting:\\
	\begin{center}
		\begin{tabular}{lcl}
			$rs$ &  $\stackrel{s*}{\rightsquigarrow}$ & $ \map \; \textit{bsimp} \; rs$\\
			     &  $\stackrel{s*}{\rightsquigarrow}$ & $ \flts \; (\map \; \textit{bsimp} \; rs)$\\
			     &  $\stackrel{s*}{\rightsquigarrow}$ & $ \distinctBy \; 
			(\flts \; (\map \; \textit{bsimp}\; rs)) \; \rerases \; \phi$\\
		\end{tabular}
	\end{center}
	Using this we can derive the following rewrite sequence:\\
	\begin{center}
		\begin{tabular}{lcl}
			$r$ & $=$ & $_{bs}\sum rs$\\[1.5ex]
			    & $\rightsquigarrow^*$ & $\bsimpalts \; bs \; rs$ \\[1.5ex]
			    & $\rightsquigarrow^*$ & $\ldots$ \\ [1.5ex]
			    & $\rightsquigarrow^*$ & $\bsimpalts \; bs \; 
			    (\distinctBy \; (\flts \; (\map \; \textit{bsimp}\; rs)) 
			    \; \rerases \; \phi)$\\[1.5ex]
			    %& $\rightsquigarrow^*$ & $ _{bs} \sum (\distinctBy \; 
				%(\flts \; (\map \; \textit{bsimp}\; rs)) \; \;
				%\rerases \; \;\phi) $\\[1.5ex]
			    & $\rightsquigarrow^*$ & $\textit{bsimp} \; r$\\[1.5ex]
		\end{tabular}
	\end{center}	
\end{proof}
\subsubsection{Property 3: $r_1 \stackrel{*}{\rightsquigarrow}  r_2 \implies r_1 \backslash c \stackrel{*}{\rightsquigarrow} r_2 \backslash c$}
The rewrite relation 
$\rightsquigarrow$ changes into $\stackrel{*}{\rightsquigarrow}$
after derivatives are taken on both sides:
\begin{lemma}\label{rewriteBder}
	\hspace{0em}
	\begin{itemize}
		\item
			If $r_1 \rightsquigarrow r_2$, then $r_1 \backslash c 
			\rightsquigarrow^*  r_2 \backslash c$ 
		\item	
			If $rs_1 \stackrel{s}{\rightsquigarrow} rs_2$, then $ 
			\map \; (\_\backslash c) \; rs_1 
			\stackrel{s*}{\rightsquigarrow} \map \; (\_ \backslash c) \; rs_2$
	\end{itemize}
\end{lemma}
\begin{proof}
	By induction on $\rightsquigarrow$ 
	and $\stackrel{s}{\rightsquigarrow}$, using a number of the previous lemmas.
\end{proof}
\noindent
Now we can prove property 3 as an immediate corollary:
\begin{corollary}\label{rewritesBder}
	$r_1 \rightsquigarrow^* r_2 \implies r_1 \backslash c \rightsquigarrow^*   
	r_2 \backslash c$
\end{corollary}
\begin{proof}
	By rule induction of $\stackrel{*}{\rightsquigarrow} $ and   lemma \ref{rewriteBder}.
\end{proof}
\noindent
This can be extended and combined with $r \rightsquigarrow^* \textit{bsimp} \; r$
to obtain the correspondence between
$\blexer$ and $\blexersimp$'s intermediate
derivative regular expressions 
\begin{lemma}\label{bderBderssimp}
	$a \backslash s \rightsquigarrow^* \bderssimp{a}{s} $
\end{lemma}
\begin{proof}
	By an induction on $s$.
\end{proof}
\subsection{Main Theorem}
Now with \ref{bderBderssimp} in place we are ready for the main theorem.
\begin{theorem}
	$\blexer \; r \; s = \blexersimp{r}{s}$
\end{theorem}
\noindent
\begin{proof}
	We can rewrite in many steps from the original lexer's 
	derivative regular expressions to the 
	lexer with simplification applied (by lemma \ref{bderBderssimp}):
	\begin{center}
		$a \backslash s \rightsquigarrow^* \bderssimp{a}{s} $.
	\end{center}
	We know that they generate the same bits, if the lexing result is a match:
	\begin{center}
		$\bnullable \; (a \backslash s) 
		\implies \bmkeps \; (a \backslash s) = \bmkeps \; (\bderssimp{a}{s})$
	\end{center}
	Now that they generate the same bits, we know they also give the same value after decoding.
	\begin{center}
		$\bnullable \; (a \backslash s) 
		\implies \decode \; r \; (\bmkeps \; (a \backslash s)) = 
		\decode \; r \; (\bmkeps \; (\bderssimp{a}{s}))$
	\end{center}
	Which is required by our proof goal:
	\begin{center}
		$\blexer \; r \; s = \blexersimp \; r \; s$.
	\end{center}	
\end{proof}
\noindent
As a corollary,
we can link this result with the lemma we proved earlier that 
\begin{center}
	$(r, s) \rightarrow v \;\; \textit{iff}\;\; \blexer \; r \; s = \Some \;v$\\
	$\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer\;
	r\;s = \None$.
\end{center}
and obtain the property that the bit-coded lexer with simplification is
indeed correctly generating a POSIX lexing result, if such a result exists.
\begin{corollary}
	$(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp \; r\; s = \Some \; v$\\
	$\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp\;
	r\;s = \None$.
\end{corollary}

\subsection{Comments on the Proof Techniques Used}
Straightforward as the proof may seem,
the efforts we spent obtaining it were far from trivial.
We initially attempted to re-use the argument 
in \cref{flex_retrieve}. 
The problem is that both functions $\inj$ and $\retrieve$ require 
that the annotated regular expressions stay unsimplified, 
so that one can 
correctly compare $v_{i+1}$ and $r_i$  and $v_i$ 
in diagram \ref{graph:inj}.

We also tried to prove 
\begin{center}
$\textit{bsimp} \;\; (\bderssimp{a}{s}) = 
\textit{bsimp} \;\;  (a\backslash s)$,
\end{center}
but this turns out to be not true.
A counterexample is
\[ a = [(_{Z}1+_{S}c)\cdot [bb \cdot (_{Z}1+_{S}c)]] \;\; 
	\text{and} \;\; s = bb.
\]
\noindent
Then we would have 
\begin{center}
	$\textit{bsimp}\;\; ( a \backslash s )$ =
	$_{[]}(_{ZZ}\ONE +  _{ZS}c ) $
\end{center}
\noindent
whereas 
\begin{center}
	$\textit{bsimp} \;\;( \bderssimp{a}{s} )$ =  
	$_{Z}(_{Z} \ONE + _{S} c)$.
\end{center}
Unfortunately, 
if we apply $\textit{bsimp}$ differently
we will always have this discrepancy. 
This is due to 
the $\map \; (\fuse\; bs) \; as$ operation 
happening at different locations in the regular expression.

The rewriting relation 
$\rightsquigarrow^*$ 
allows us to ignore this discrepancy
and view the expressions 
\begin{center}
	$_{[]}(_{ZZ}\ONE +  _{ZS}c ) $\\
	and\\
	$_{Z}(_{Z} \ONE + _{S} c)$

\end{center}
as equal because they were both re-written
from the same expression.

The simplification rewriting rules
given in \ref{rrewriteRules} are by no means
final,
one could come up with new rules
such as 
$\SEQ r_1 \cdot (\SEQ r_1 \cdot r_3) \rightarrow
\SEQs [r_1, r_2, r_3]$.
However this does not fit with the proof technique
of our main theorem, but seem to not violate the POSIX
property.

Having established the correctness of our
$\blexersimp$, in the next chapter we shall prove that with our $\simp$ function,
for a given $r$, the derivative size is always
finitely bounded by a constant.
author	Chengsong
	Sun, 18 Jun 2023 17:54:52 +0100
changeset 649	ef2b8abcbc55
parent 640	bd1354127574
child 650	a365d1364640
permissions	-rwxr-xr-x