lexing: ChengsongTanPhdThesis/Chapters/Finite.tex@61139fdddae0


% Chapter Template

\chapter{Finiteness Bound} % Main chapter title

\label{Finite} 
%  In Chapter 4 \ref{Chapter4} we give the second guarantee
%of our bitcoded algorithm, that is a finite bound on the size of any 
%regex's derivatives. 

In this chapter we give a guarantee in terms of size: 
given an annotated regular expression $a$, for any string $s$
our algorithm $\blexersimp$'s internal annotated regular expression 
size  is finitely bounded
by a constant $N_a$ that only depends on $a$:
\begin{center}
	$\llbracket \bderssimp{a}{s} \rrbracket \leq N_a$
\end{center}
\noindent
where the size of an annotated regular expression is defined
in terms of the number of nodes in its tree structure:
\begin{center}
	\begin{tabular}{ccc}
		$\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\
		$\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\
		$\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\
		$\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\
		$\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as   + 1$\\
		$\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$.
	\end{tabular}
\end{center}
We believe this size formalisation 
of the algorithm is important in our context, because 
\begin{itemize}
	\item
		It is a stepping stone towards an ``absence of catastrophic-backtracking''
		guarantee. The next step would be to refine the bound $N_a$ so that it
		is polynomial on $\llbracket a\rrbracket$.
	\item
		The size bound proof gives us a higher confidence that
		our simplification algorithm $\simp$ does not ``mis-behave''
		like $\simpsulz$ does.
		The bound is universal, which is an advantage over work which 
		only gives empirical evidence on some test cases.
\end{itemize}
\section{Formalising About Size}
\noindent
In our lexer $\blexersimp$,
The regular expression is repeatedly being taken derivative of
and then simplified.
\begin{figure}[H]
	\begin{tikzpicture}[scale=2,
		every node/.style={minimum size=11mm},
		->,>=stealth',shorten >=1pt,auto,thick
		]
		\node (r0) [rectangle, draw=black, thick, minimum size = 5mm, draw=blue] {$a$};
		\node (r1) [rectangle, draw=black, thick, right=of r0, minimum size = 7mm]{$a_1$};
		\draw[->,line width=0.2mm](r0)--(r1) node[above,midway] {$\backslash c_1$};

		\node (r1s) [rectangle, draw=blue, thick, right=of r1, minimum size=6mm]{$a_{1s}$};
		\draw[->, line width=0.2mm](r1)--(r1s) node[above, midway] {$\simp$};

		\node (r2) [rectangle, draw=black, thick,  right=of r1s, minimum size = 12mm]{$a_2$};
		\draw[->,line width=0.2mm](r1s)--(r2) node[above,midway] {$\backslash c_2$};

		\node (r2s) [rectangle, draw = blue, thick, right=of r2,minimum size=6mm]{$a_{2s}$};
		\draw[->,line width=0.2mm](r2)--(r2s) node[above,midway] {$\simp$};

		\node (rns) [rectangle, draw = blue, thick, right=of r2s,minimum size=6mm]{$a_{ns}$};
		\draw[->,line width=0.2mm, dashed](r2s)--(rns) node[above,midway] {$\backslash \ldots$};

		\node (v) [circle, thick, draw, right=of rns, minimum size=6mm, right=1.7cm]{$v$};
		\draw[->, line width=0.2mm](rns)--(v) node[above, midway] {\bmkeps} node [below, midway] {\decode};
	\end{tikzpicture}
	\caption{Regular expression size change during our $\blexersimp$ algorithm}\label{simpShrinks}
\end{figure}
\noindent
Each time
a derivative is taken, a regular expression might grow a bit,
but simplification always takes care that 
it stays small.
This intuition is depicted by the relative size
change between the black and blue nodes:
After $\simp$ the node always shrinks.
Our proof says that all the blue nodes
stay below a size bound $N_a$ determined by $a$.

\noindent
Sulzmann and Lu's assumed something similar about their algorithm,
though in fact their algorithm's size might be better depicted by the following graph:
\begin{figure}[H]
	\begin{tikzpicture}[scale=2,
		every node/.style={minimum size=11mm},
		->,>=stealth',shorten >=1pt,auto,thick
		]
		\node (r0) [rectangle, draw=black, thick, minimum size = 5mm, draw=blue] {$a$};
		\node (r1) [rectangle, draw=black, thick, right=of r0, minimum size = 7mm]{$a_1$};
		\draw[->,line width=0.2mm](r0)--(r1) node[above,midway] {$\backslash c_1$};

		\node (r1s) [rectangle, draw=blue, thick, right=of r1, minimum size=7mm]{$a_{1s}$};
		\draw[->, line width=0.2mm](r1)--(r1s) node[above, midway] {$\simp'$};

		\node (r2) [rectangle, draw=black, thick,  right=of r1s, minimum size = 17mm]{$a_2$};
		\draw[->,line width=0.2mm](r1s)--(r2) node[above,midway] {$\backslash c_2$};

		\node (r2s) [rectangle, draw = blue, thick, right=of r2,minimum size=14mm]{$a_{2s}$};
		\draw[->,line width=0.2mm](r2)--(r2s) node[above,midway] {$\simp'$};

		\node (r3) [rectangle, draw = black, thick, right= of r2s, minimum size = 22mm]{$a_3$};
		\draw[->,line width=0.2mm](r2s)--(r3) node[above,midway] {$\backslash c_3$};

		\node (rns) [right = of r3, draw=blue, minimum size = 20mm]{$a_{3s}$};
		\draw[->,line width=0.2mm] (r3)--(rns) node [above, midway] {$\simp'$};

		\node (rnn) [right = of rns, minimum size = 1mm]{};
		\draw[->, dashed] (rns)--(rnn) node [above, midway] {$\ldots$};

	\end{tikzpicture}
	\caption{Regular expression size change during our $\blexersimp$ algorithm}\label{sulzShrinks}
\end{figure}
\noindent
That is, on certain cases their lexer 
will have an indefinite size explosion, causing the running time 
of each derivative step to grow arbitrarily large (for example 
in \ref{SulzmannLuLexerTime}).
The reason they made this mistake was that
they tested out the run time of their
lexer on particular examples such as $(a+b+ab)^*$
and generalised to all cases, which
cannot happen with our mecahnised proof.\\
We give details of the proof in the next sections.

\subsection{Overview of the Proof}
Here is a bird's eye view of how the proof of finiteness works,
which involves three steps:
\begin{figure}[H]
	\begin{tikzpicture}[scale=1,font=\bf,
		node/.style={
			rectangle,rounded corners=3mm,
			ultra thick,draw=black!50,minimum height=18mm, 
			minimum width=20mm,
		top color=white,bottom color=black!20}]


		\node (0) at (-5,0) 
			[node, text width=1.8cm, text centered] 
			{$\llbracket \bderssimp{a}{s} \rrbracket$};
		\node (A) at (0,0) 
			[node,text width=1.6cm,  text centered] 
			{$\llbracket \rderssimp{r}{s} \rrbracket_r$};
		\node (B) at (3,0) 
			[node,text width=3.0cm, anchor=west, minimum width = 40mm] 
			{$\llbracket \textit{ClosedForm}(r, s)\rrbracket_r$};
		\node (C) at (9.5,0) [node, minimum width=10mm] {$N_r$};

		\draw [->,line width=0.5mm] (0) -- 
			node [above,pos=0.45] {=} (A) node [below, pos = 0.45] {$(r = a \downarrow_r)$} (A); 
		\draw [->,line width=0.5mm] (A) -- 
			node [above,pos=0.35] {$\quad =\ldots=$} (B); 
		\draw [->,line width=0.5mm] (B) -- 
			node [above,pos=0.35] {$\quad \leq \ldots \leq$} (C); 
	\end{tikzpicture}
	%\caption{
\end{figure}
\noindent
We explain the steps one by one:
\begin{itemize}
	\item
		We first introduce the operations such as 
		derivatives, simplification, size calculation, etc.
		associated with $\rrexp$s, which we have given
		a very brief introduction to in chapter \ref{Bitcoded2}.
		The operations on $\rrexp$s are identical to those on
		annotated regular expressions except that they are unaware
		of bitcodes. This means that all proofs about size of $\rrexp$s will apply to
		annotated regular expressions.
	\item
		We prove that $\rderssimp{r}{s} = \textit{ClosedForm}(r, s)$,
		where $\textit{ClosedForm}(r, s)$ is entirely 
		written in the derivatives of their children regular 
		expressions.
		We call the right-hand-side the \emph{Closed Form}
		of the derivative $\rderssimp{r}{s}$.
	\item
		We estimate $\llbracket \textit{ClosedForm}(r, s) \rrbracket_r$.
		The key observation is that $\distinctBy$'s output is
		a list with a constant length bound.
\end{itemize}
We will expand on these steps in the next sections.\\

\section{The $\textit{Rrexp}$ Datatype and Its Lexing-Related Functions}
The first step is to define 
$\textit{rrexp}$s.
They are without bitcodes,
allowing a much simpler size bound proof.
Of course, the bits which encode the lexing information 
would grow linearly with respect 
to the input, which should be taken into account when we wish to tackle the runtime comlexity.
But for the sake of the structural size 
we can safely ignore them.\\
To recapitulate, the datatype 
definition of the $\rrexp$, called
\emph{r-regular expressions},
was initially defined in \ref{rrexpDef}.
The reason for the prefix $r$ is
to make a distinction  
with basic regular expressions.
\[			\rrexp ::=   \RZERO \mid  \RONE
	\mid  \RCHAR{c}  
	\mid  \RSEQ{r_1}{r_2}
	\mid  \RALTS{rs}
	\mid \RSTAR{r}        
\]
The size of an r-regular expression is
written $\llbracket r\rrbracket_r$, 
whose definition mirrors that of an annotated regular expression. 
\begin{center}
	\begin{tabular}{ccc}
		$\llbracket _{bs}\ONE \rrbracket_r$ & $\dn$ & $1$\\
		$\llbracket \ZERO \rrbracket_r$ & $\dn$ & $1$ \\
		$\llbracket _{bs} r_1 \cdot r_2 \rrbracket_r$ & $\dn$ & $\llbracket r_1 \rrbracket_r + \llbracket r_2 \rrbracket_r + 1$\\
		$\llbracket _{bs}\mathbf{c} \rrbracket_r $ & $\dn$ & $1$\\
		$\llbracket _{bs}\sum as \rrbracket_r $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket_r)\; as   + 1$\\
		$\llbracket _{bs} a^* \rrbracket_r $ & $\dn$ & $\llbracket a \rrbracket_r + 1$.
	\end{tabular}
\end{center}
\noindent
The $r$ in the subscript of $\llbracket \rrbracket_r$ is to 
differentiate with the same operation for annotated regular expressions.
Adding $r$ as subscript will be used in 
other operations as well.\\
The transformation from an annotated regular expression
to an r-regular expression is straightforward.
\begin{center}
	\begin{tabular}{lcl}
		$\rerase{\ZERO}$ & $\dn$ & $\RZERO$\\
		$\rerase{_{bs}\ONE}$ & $\dn$ & $\RONE$\\
		$\rerase{_{bs}\mathbf{c}}$ & $\dn$ & $\RCHAR{c}$\\
		$\rerase{_{bs}r_1\cdot r_2}$ & $\dn$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\
		$\rerase{_{bs}\sum as}$ & $\dn$ & $\RALTS{\map \; \rerase{\_} \; as}$\\
		$\rerase{_{bs} a ^*}$ & $\dn$ & $\rerase{a} ^*$
	\end{tabular}
\end{center}
\noindent
$\textit{Rerase}$ throws away the bitcodes 
on the annotated regular expressions, 
but keeps everything else intact.
Therefore it does not change the size
of an annotated regular expression:
\begin{lemma}\label{rsizeAsize}
	$\rsize{\rerase a} = \asize a$
\end{lemma}
\begin{proof}
	By routine structural induction on $a$.
\end{proof}
\noindent

\subsection{Motivation Behind a New Datatype}
The reason we take all the trouble 
defining a new datatype is that $\erase$ makes things harder.
We initially started by using 
plain regular expressions and tried to prove
the lemma \ref{rsizeAsize},
however the $\erase$ function unavoidbly messes with the structure of the 
annotated regular expression.
The $+$ constructor
of basic regular expressions is binary whereas $\sum$ 
takes a list, and one has to convert between them:
\begin{center}
	\begin{tabular}{ccc}
		$\erase \; _{bs}\sum [] $ & $\dn$ & $\ZERO$\\
		$\erase \; _{bs}\sum [a]$ & $\dn$ & $a$\\
		$\erase \; _{bs}\sum a :: as$ & $\dn$ & $a + (\erase \; _{[]} \sum as)\quad \text{if $as$ length over 1}$
	\end{tabular}
\end{center}
\noindent
An alternative regular expression with an empty argument list
will be turned into a $\ZERO$.
The singleton alternative $\sum [r]$ would have $r$ during the
$\erase$ function.
The  annotated regular expression $\sum[a, b, c]$ would turn into
$(a+(b+c))$.
All these operations change the size and structure of
an annotated regular expressions, adding unnecessary 
complexities to the size bound proof.\\
For example, if we define the size of a basic regular expression 
in the usual way,
\begin{center}
	\begin{tabular}{ccc}
		$\llbracket \ONE \rrbracket_p$ & $\dn$ & $1$\\
		$\llbracket \ZERO \rrbracket_p$ & $\dn$ & $1$ \\
		$\llbracket r_1 \cdot r_2 \rrbracket_p$ & $\dn$ & $\llbracket r_1 \rrbracket_p + \llbracket r_2 \rrbracket_p + 1$\\
		$\llbracket \mathbf{c} \rrbracket_p $ & $\dn$ & $1$\\
		$\llbracket r_1 \cdot r_2 \rrbracket_p $ & $\dn$ & $\llbracket r_1 \rrbracket_p \; + \llbracket r_2 \rrbracket_p + 1$\\
		$\llbracket a^* \rrbracket_p $ & $\dn$ & $\llbracket a \rrbracket_p + 1$
	\end{tabular}
\end{center}
\noindent
Then the property
\begin{center}
	$\llbracket a \rrbracket = \llbracket a_\downarrow \rrbracket_p$
\end{center}
does not hold.
One might be able to prove an inequality such as
$\llbracket a \rrbracket  \leq \llbracket  a_\downarrow \rrbracket_p $
and then estimate $\llbracket  a_\downarrow \rrbracket_p$,
but we found our approach more straightforward.\\

\subsection{Lexing Related Functions for $\rrexp$ such as $\backslash_r$, $\rdistincts$, and $\rsimp$}
The operations on r-regular expressions are 
almost identical to those of the annotated regular expressions,
except that no bitcodes are used. For example,
the derivative operation becomes simpler:\\
\begin{center}
	\begin{tabular}{@{}lcl@{}}
		$(\ZERO)\,\backslash_r c$ & $\dn$ & $\ZERO$\\  
		$(\ONE)\,\backslash_r c$ & $\dn$ &
		$\textit{if}\;c=d\; \;\textit{then}\;
		\ONE\;\textit{else}\;\ZERO$\\  
		$(\sum \;\textit{rs})\,\backslash_r c$ & $\dn$ &
		$\sum\;(\textit{map} \; (\_\backslash_r c) \; rs )$\\
		$(r_1\cdot r_2)\,\backslash_r c$ & $\dn$ &
		$\textit{if}\;(\textit{rnullable}\,r_1)$\\
						 & &$\textit{then}\;\sum\,[(r_1\,\backslash_r c)\cdot\,r_2,$\\
						 & &$\phantom{\textit{then},\;\sum\,}((r_2\,\backslash_r c))]$\\
						 & &$\textit{else}\;\,(r_1\,\backslash_r c)\cdot r_2$\\
		$(r^*)\,\backslash_r c$ & $\dn$ &
		$( r\,\backslash_r c)\cdot
		(_{[]}r^*))$
	\end{tabular}    
\end{center}  
\noindent
Similarly, $\distinctBy$ does not need 
a function checking equivalence because
there are no bit annotations causing superficial differences
between syntactically equal terms.
\begin{center}
	\begin{tabular}{lcl}
		$\rdistinct{[]}{rset} $ & $\dn$ & $[]$\\
		$\rdistinct{r :: rs}{rset}$ & $\dn$ & 
		$\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\
					    &        & $\textit{else}\; \;
					    r::\rdistinct{rs}{(rset \cup \{r\})}$
	\end{tabular}
\end{center}
%TODO: definition of rsimp (maybe only the alternative clause)
\noindent
We would like to make clear
a difference between our $\rdistincts$ and
the Isabelle $\textit {distinct}$ predicate.
In Isabelle $\textit{distinct}$ is a function that returns a boolean
rather than a list.
It tests if all the elements of a list are unique.\\
With $\textit{rdistinct}$,
and the flatten function for $\rrexp$s:
 \begin{center}
  \begin{tabular}{@{}lcl@{}}
  $\textit{rflts} \; (\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $as \; @ \; \textit{rflts} \; as' $ \\
  $\textit{rflts} \; \ZERO :: as'$ & $\dn$ & $ \textit{rflts} \;  \textit{as'} $ \\
    $\textit{rflts} \; a :: as'$ & $\dn$ & $a :: \textit{rflts} \; \textit{as'}$ \quad(otherwise) 
\end{tabular}    
\end{center}  
\noindent
one can chain together all the other modules
such as $\rsimpalts$:
\begin{center}
  \begin{tabular}{@{}lcl@{}}
	  $\rsimpalts \;\; nil$ & $\dn$ & $\RZERO$\\
	  $\rsimpalts \;\; r::nil$ & $\dn$ & $r$\\
	  $\rsimpalts \;\; rs$ & $\dn$ & $\sum rs$\\
\end{tabular}    
\end{center}  
\noindent
and $\rsimpseq$:
\begin{center}
  \begin{tabular}{@{}lcl@{}}
	  $\rsimpseq \;\; \RZERO \; \_ $ &   $=$ &   $\RZERO$\\
	  $\rsimpseq \;\; \_ \; \RZERO $ &   $=$ &   $\RZERO$\\
	  $\rsimpseq \;\; \RONE \cdot r_2$ & $\dn$ & $r_2$\\
	  $\rsimpseq \;\; r_1 r_2$ & $\dn$ & $r_1 \cdot r_2$\\
\end{tabular}    
\end{center}  
and get $\textit{rsimp}$ and $\rderssimp{\_}{\_}$:
\begin{center}
  \begin{tabular}{@{}lcl@{}}
   
	  $\textit{rsimp} \; (r_1\cdot r_2)$ & $\dn$ & $ \textit{rsimp}_{SEQ} \; bs \;(\textit{rsimp} \; r_1) \; (\textit{rsimp}  \; r_2)  $ \\
	  $\textit{rsimp} \; (_{bs}\sum \textit{rs})$ & $\dn$ & $\textit{rsimp}_{ALTS} \; \textit{bs} \; (\textit{rdistinct} \; ( \textit{rflts} ( \textit{map} \; rsimp \; rs)) \; \rerases \; \varnothing) $ \\
   $\textit{rsimp} \; r$ & $\dn$ & $\textit{r} \qquad \textit{otherwise}$   
\end{tabular}    
\end{center} 
\begin{center}
	\begin{tabular}{@{}lcl@{}}
		$r\backslash_{rsimp} \, c$ & $\dn$ & $\rsimp \; (r\backslash_r \, c)$
	\end{tabular}
\end{center}

\begin{center}
	\begin{tabular}{@{}lcl@{}}
$r \backslash_{rsimps} \; \; c\!::\!s $ & $\dn$ & $(r \backslash_{rsimp}\, c) \backslash_{rsimps}\, s$ \\
$r \backslash_{rsimps} [\,] $ & $\dn$ & $r$
	\end{tabular}
\end{center}
\noindent
We do not define an r-regular expression version of $\blexersimp$,
as our proof does not involve its use 
(and there is no bitcode to decode into a lexical value). 
Everything about the size of annotated regular expressions
can be calculated via the size of r-regular expressions:
\begin{lemma}\label{sizeRelations}
	The following equalities hold:
	\begin{itemize}
		\item
			$\asize{\bsimps \; a} = \rsize{\rsimp{ \rerase{a}}}$
		\item
			$\asize{\bderssimp{a}{s}} =  \rsize{\rderssimp{\rerase{a}}{s}}$
	\end{itemize}
\end{lemma}
\begin{proof}
	The first part is by induction on the inductive cases
	of $\textit{bsimp}$.
	The second part is by induction on the string $s$,
	where the inductive step follows from part one.
\end{proof}
\noindent
With lemma \ref{sizeRelations},
we will be able to focus on 
estimating only
$\rsize{\rderssimp{\rerase{a}}{s}}$
in later parts because
\begin{center}
	$\rsize{\rderssimp{\rerase{a}}{s}} \leq N_r \quad$
	implies
	$\quad \llbracket a \backslash_{bsimps} s \rrbracket \leq N_r$.
\end{center}
Unless stated otherwise in the rest of this 
chapter all regular expressions without
bitcodes are seen as r-regular expressions ($\rrexp$s).
For the binary alternative r-regular expression $\RALTS{[r_1, r_2]}$,
we use the notation $r_1 + r_2$
for brevity.


%-----------------------------------
%	SUB SECTION ROADMAP RREXP BOUND
%-----------------------------------

%\subsection{Roadmap to a Bound for $\textit{Rrexp}$}

%The way we obtain the bound for $\rrexp$s is by two steps:
%\begin{itemize}
%	\item
%		First, we rewrite $r\backslash s$ into something else that is easier
%		to bound. This step is especially important for the inductive case 
%		$r_1 \cdot r_2$ and $r^*$, where the derivative can grow and bloat in a wild way,
%		but after simplification they will always be equal or smaller to a form consisting of an alternative
%		list of regular expressions $f \; (g\; (\sum rs))$ with some functions applied to it, where each element will be distinct after the function application.
%	\item
%		Then, for such a sum  list of regular expressions $f\; (g\; (\sum rs))$, we can control its size
%		by estimation, since $\distinctBy$ and $\flts$ are well-behaved and working together would only 
%		reduce the size of a regular expression, not adding to it.
%\end{itemize}
%
%\section{Step One: Closed Forms}
%We transform the function application $\rderssimp{r}{s}$
%into an equivalent 
%form $f\; (g \; (\sum rs))$.
%The functions $f$ and $g$ can be anything from $\flts$, $\distinctBy$ and other helper functions from $\bsimp{\_}$.
%This way we get a different but equivalent way of expressing : $r\backslash s = f \; (g\; (\sum rs))$, we call the
%right hand side the "closed form" of $r\backslash s$.
%
%\begin{quote}\it
%	Claim: For regular expressions $r_1 \cdot r_2$, we claim that
%	\begin{center}
%		$ \rderssimp{r_1 \cdot r_2}{s} = 
%		\rsimp{(\sum (r_1 \backslash s \cdot r_2 ) \; :: \;(\map \; \rderssimp{r2}{\_} \;(\vsuf{s}{r_1})))}$
%	\end{center}
%\end{quote}
%\noindent
%We explain in detail how we reached those claims.
\subsection{The Idea Behind Closed Forms}
If we attempt to prove 
\begin{center}
	$\forall r. \; \exists N_r.\;\; s.t. \llbracket r\backslash_{rsimps} s \rrbracket_r \leq N_r$
\end{center}
using a naive induction on the structure of $r$,
then we are stuck at the inductive cases such as
$r_1\cdot r_2$.
The inductive hypotheses are:
\begin{center}
	1: $\text{for } r_1, \text{there exists } N_{r_1}.\;\; s.t. 
	\;\;\forall s.  \llbracket r_1 \backslash_{rsimps} s \rrbracket_r \leq N_{r_1}. $\\
	2: $\text{for } r_2, \text{there exists } N_{r_2}.\;\; s.t. 
	\;\; \forall s. \llbracket r_2 \backslash_{rsimps} s \rrbracket_r \leq N_{r_2}. $
\end{center}
The inductive step to prove would be 
\begin{center}
	$\text{there exists } N_{r_1\cdot r_2}. \;\; s.t. \forall s. 
	\llbracket (r_1 \cdot r_2) \backslash_{rsimps} s \rrbracket_r \leq N_{r_1\cdot r_2}.$
\end{center}
The problem is that it is not clear what 
$(r_1\cdot r_2) \backslash_{rsimps} s$ looks like,
and therefore $N_{r_1}$ and $N_{r_2}$ in the
inductive hypotheses cannot be directly used.
We have already seen that $(r_1 \cdot r_2)\backslash s$ 
and $(r^*)\backslash s$ can grow in a wild way.
The point is that they will be equivalent to a list of
terms $\sum rs$, where each term in $rs$ will
be made of $r_1 \backslash s' $, $r_2\backslash s'$,
and $r \backslash s'$ with $s' \in \textit{SubString} \; s$.
The list $\sum rs$ will then be de-duplicated by $\textit{rdistinct}$
in the simplification which saves $rs$ from growing indefinitely.

Based on this idea, we sketch a proof by first showing the equality (where
$f$ and $g$ are functions that do not increase the size of the input)
\begin{center}
$(r_1 \cdot r_2)\backslash_{rsimps} s = f\; (\textit{rdistinct} \; (g\; \sum rs))$,
\end{center}
and then show the right-hand-side can be finitely bounded.
We call the right-hand-side the 
\emph{Closed Form} of $(r_1 \cdot r_2)\backslash_{rsimps} s$.
We will flesh out the proof sketch in the next section.

\section{Details of Closed Forms and Bounds}
In this section we introduce in detail
how the closed forms are obtained for regular expressions'
derivatives and how they are bounded.
We start by proving some basic identities
involving the simplification functions for r-regular expressions.
After that we use these identities to establish the
closed forms we need.
Finally, we prove the functions such as $\flts$
will keep the size non-increasing.
Putting this together with a general bound 
on the finiteness of distinct regular expressions
smaller than a certain size, we obtain a bound on 
the closed forms.
%$r_1\cdot r_2$, $r^*$ and $\sum rs$.



\subsection{Some Basic Identities}


\subsubsection{$\textit{rdistinct}$'s Deduplicates Successfully}
The $\textit{rdistinct}$ function, as its name suggests, will
remove duplicates in an r-regular expression list.
It will also correctly exclude any elements that 
is intially in the accumulator set.
\begin{lemma}\label{rdistinctDoesTheJob}
	%The function $\textit{rdistinct}$ satisfies the following
	%properties:
	Assume we have the predicate $\textit{isDistinct}$\footnote{We omit its
	recursive definition here, its Isabelle counterpart would be $\textit{distinct}$.} 
	readily defined
	for testing
	whether a list's elements are all unique. Then the following
	properties about $\textit{rdistinct}$ hold:
	\begin{itemize}
		\item
			If $a \in acc$ then $a \notin (\rdistinct{rs}{acc})$.
		\item
			%If list $rs'$ is the result of $\rdistinct{rs}{acc}$,
			$\textit{isDistinct} \;\;\; (\rdistinct{rs}{acc})$.
		\item
			$\textit{set} \; (\rdistinct{rs}{acc}) 
			= (textit{set} \; rs) - acc$
	\end{itemize}
\end{lemma}
\noindent
\begin{proof}
	The first part is by an induction on $rs$.
	The second and third part can be proven by using the 
	inductive cases of $\textit{rdistinct}$.

\end{proof}

\noindent
$\rdistinct{\_}{\_}$ will cancel out all regular expression terms
that are in the accumulator, therefore prepending a list $rs_a$ with an arbitrary
list $rs$ whose elements are all from the accumulator, and then call $\rdistinct{\_}{\_}$
on the resulting list, the output will be as if we had called $\rdistinct{\_}{\_}$
without the prepending of $rs$:
\begin{lemma}\label{rdistinctConcat}
	The elements appearing in the accumulator will always be removed.
	More precisely,
	\begin{itemize}
		\item
			If $rs \subseteq rset$, then 
			$\rdistinct{rs@rsa }{acc} = \rdistinct{rsa }{acc}$.
		\item
			More generally, if $a \in rset$ and $\rdistinct{rs}{\{a\}} = []$,
			then $\rdistinct{(rs @ rs')}{rset} = \rdistinct{rs'}{rset}$
	\end{itemize}
\end{lemma}

\begin{proof}
	By induction on $rs$ and using \ref{rdistinctDoesTheJob}.
\end{proof}
\noindent
On the other hand, if an element $r$ does not appear in the input list waiting to be deduplicated,
then expanding the accumulator to include that element will not cause the output list to change:
\begin{lemma}
	The accumulator can be augmented to include elements not appearing in the input list,
	and the output will not change.	
	\begin{itemize}
		\item
			If $r \notin rs$, then $\rdistinct{rs}{acc} = \rdistinct{rs}{\{r\} \cup acc}$.
		\item
			Particularly, when $acc = \varnothing$ and $rs$ de-duplicated, we have\\
			\[ \rdistinct{rs}{\varnothing} = rs \]
	\end{itemize}
\end{lemma}
\begin{proof}
	The first half is by induction on $rs$. The second half is a corollary of the first.
\end{proof}
\noindent
The next property gives the condition for
when $\rdistinct{\_}{\_}$ becomes an identical mapping
for any prefix of an input list, in other words, when can 
we ``push out" the arguments of $\rdistinct{\_}{\_}$:
\begin{lemma}\label{distinctRdistinctAppend}
	If $\textit{isDistinct} \; rs_1$, and $rs_1 \cap acc = \varnothing$,
	then 
	\[\textit{rdistinct}\;  (rs_1 @ rsa)\;\, acc
	= rs_1@(\textit{rdistinct} rsa \; (acc \cup rs_1))\]
\end{lemma}
\noindent
In other words, it can be taken out and left untouched in the output.
\begin{proof}
	By an induction on $rs_1$, where $rsa$ and $acc$ are allowed to be arbitrary.
\end{proof}
\noindent
$\rdistinct{}{}$ removes any element in anywhere of a list, if it
had appeared previously:
\begin{lemma}\label{distinctRemovesMiddle}
	The two properties hold if $r \in rs$:
	\begin{itemize}
		\item
			$\rdistinct{rs}{rset} = \rdistinct{(rs @ [r])}{rset}$\\
			and\\
			$\rdistinct{(ab :: rs @ [ab])}{rset'} = \rdistinct{(ab :: rs)}{rset'}$
		\item
			$\rdistinct{ (rs @ rs') }{rset} = \rdistinct{rs @ [r] @ rs'}{rset}$\\
			and\\
			$\rdistinct{(ab :: rs @ [ab] @ rs'')}{rset'} = 
			\rdistinct{(ab :: rs @ rs'')}{rset'}$
	\end{itemize}
\end{lemma}
\noindent
\begin{proof}
	By induction on $rs$. All other variables are allowed to be arbitrary.
	The second half of the lemma requires the first half.
	Note that for each half's two sub-propositions need to be proven concurrently,
	so that the induction goes through.
\end{proof}

\noindent
This allows us to prove ``Idempotency" of $\rdistinct{}{}$ of some kind:
\begin{lemma}\label{rdistinctConcatGeneral}
	The following equalities involving multiple applications  of $\rdistinct{}{}$ hold:
	\begin{itemize}
		\item
			$\rdistinct{(rs @ rs')}{\varnothing} = \rdistinct{((\rdistinct{rs}{\varnothing})@ rs')}{\varnothing}$
		\item
			$\rdistinct{(rs @ rs')}{\varnothing} = \rdistinct{(\rdistinct{rs}{\varnothing} @ rs')}{\varnothing}$
		\item
			If $rset' \subseteq rset$, then $\rdistinct{rs}{rset} = 
			\rdistinct{(\rdistinct{rs}{rset'})}{rset}$. As a corollary
			of this,
		\item
			$\rdistinct{(rs @ rs')}{rset} = \rdistinct{
			(\rdistinct{rs}{\varnothing}) @ rs')}{rset}$. This
			gives another corollary use later:
		\item
			If $a \in rset$, then $\rdistinct{(rs @ rs')}{rset} = \rdistinct{
			(\rdistinct{(a :: rs)}{\varnothing} @ rs')}{rset} $,

	\end{itemize}
\end{lemma}
\begin{proof}
	By \ref{rdistinctDoesTheJob} and \ref{distinctRemovesMiddle}.
\end{proof}

\subsubsection{The Properties of $\backslash_r$, $\backslash_{rsimp}$, $\textit{Rflts}$ and $\textit{Rsimp}_{ALTS}$} 
We give in this subsection some properties of how $\backslash_r$, $\backslash_{rsimp}$, $\textit{Rflts}$ and $\textit{Rsimp}_{ALTS} $ interact with each other and with $@$, the concatenation operator.
These will be helpful in later closed form proofs, when
we want to transform the ways in which multiple functions involving
those are composed together
in interleaving derivative and  simplification steps.

When the function $\textit{Rflts}$ 
is applied to the concatenation of two lists, the output can be calculated by first applying the
functions on two lists separately, and then concatenating them together.
\begin{lemma}\label{rfltsProps}
	The function $\rflts$ has the below properties:\\
	\begin{itemize}
		\item
			$\rflts \; (rs_1 @ rs_2) = \rflts \; rs_1 @ \rflts \; rs_2$
		\item
			If $r \neq \RZERO$ and $\nexists rs_1. r = \RALTS{rs}_1$, then $\rflts \; (r::rs) = r :: \rflts \; rs$
		\item
			$\rflts \; (rs @ [\RZERO]) = \rflts \; rs$
		\item
			$\rflts \; (rs' @ [\RALTS{rs}]) = \rflts \; rs'@rs$
		\item
			$\rflts \; (rs @ [\RONE]) = \rflts \; rs @ [\RONE]$
		\item
			If $r \neq \RZERO$ and $\nexists rs'. r = \RALTS{rs'}$ then $\rflts \; (rs @ [r])
			= (\rflts \; rs) @ [r]$
		\item
			If $r = \RALTS{rs}$ and $r \in rs'$ then for all $r_1 \in rs. 
			r_1 \in \rflts \; rs'$.
		\item
			$\rflts \; (rs_a @ \RZERO :: rs_b) = \rflts \; (rs_a @ rs_b)$
	\end{itemize}
\end{lemma}
\noindent
\begin{proof}
	By induction on $rs_1$ in the first sub-lemma, and induction on $r$ in the second part,
	and induction on $rs$, $rs'$, $rs$, $rs'$, $rs_a$ in the third, fourth, fifth, sixth and 
	last sub-lemma.
\end{proof}

\subsubsection{The $RL$ Function: Language Interpretation of $\textit{Rrexp}$s}
Much like the definition of $L$ on plain regular expressions, one could also 
define the language interpretation of $\rrexp$s.
\begin{center}
	\begin{tabular}{lcl}
		$RL \; (\ZERO)$ & $\dn$ & $\phi$\\
		$RL \; (\ONE)$ & $\dn$ & $\{[]\}$\\
		$RL \; (c)$ & $\dn$ & $\{[c]\}$\\
		$RL \; \sum rs$ & $\dn$ & $ \bigcup_{r \in rs} (RL \; r)$\\
		$RL \; (r_1 \cdot r_2)$ & $\dn$ & $ RL \; (r_1) @ RL \; (r_2)$\\
		$RL \; (r^*)$ & $\dn$ & $ (RL(r))^*$
	\end{tabular}
\end{center}
\noindent
The main use of $RL$ is to establish some connections between $\rsimp{}$ 
and $\rnullable{}$:
\begin{lemma}
	The following properties hold:
	\begin{itemize}
		\item
			If $\rnullable{r}$, then $\rsimp{r} \neq \RZERO$.
		\item
			$\rnullable{r \backslash s} \quad $ if and only if $\quad \rnullable{\rderssimp{r}{s}}$.
	\end{itemize}
\end{lemma}
\begin{proof}
	The first part is by induction on $r$. 
	The second part is true because property 
	\[ RL \; r = RL \; (\rsimp{r})\] holds.
\end{proof}

\subsubsection{$\rsimp{}$ is Non-Increasing}
In this subsection, we prove that the function $\rsimp{}$ does not 
make the $\llbracket \rrbracket_r$ size increase.


\begin{lemma}\label{rsimpSize}
	$\llbracket \rsimp{r} \rrbracket_r \leq \llbracket r \rrbracket_r$
\end{lemma}
\subsubsection{Simplified $\textit{Rrexp}$s are Good}
We formalise the notion of ``good" regular expressions,
which means regular expressions that
are not fully simplified. For alternative regular expressions that means they
do not contain any nested alternatives like 
\[ r_1 + (r_2 + r_3) \], un-removed $\RZERO$s like \[\RZERO + r\]
or duplicate elements in a children regular expression list like \[ \sum [r, r, \ldots]\]:
\begin{center}
	\begin{tabular}{@{}lcl@{}}
		$\good\; \RZERO$ & $\dn$ & $\textit{false}$\\
		$\good\; \RONE$ & $\dn$ & $\textit{true}$\\
		$\good\; \RCHAR{c}$ & $\dn$ & $\btrue$\\
		$\good\; \RALTS{[]}$ & $\dn$ & $\bfalse$\\
		$\good\; \RALTS{[r]}$ & $\dn$ & $\bfalse$\\
		$\good\; \RALTS{r_1 :: r_2 :: rs}$ & $\dn$ & 
		$\textit{isDistinct} \; (r_1 :: r_2 :: rs) \;$\\
						   & & $\textit{and}\; (\forall r' \in (r_1 :: r_2 :: rs).\; \good \; r'\; \,  \textit{and}\; \, \textit{nonAlt}\; r')$\\
		$\good \; \RSEQ{\RZERO}{r}$ & $\dn$ & $\bfalse$\\
		$\good \; \RSEQ{\RONE}{r}$ & $\dn$ & $\bfalse$\\
		$\good \; \RSEQ{r}{\RZERO}$ & $\dn$ & $\bfalse$\\
		$\good \; \RSEQ{r_1}{r_2}$ & $\dn$ & $\good \; r_1 \;\, \textit{and} \;\, \good \; r_2$\\
		$\good \; \RSTAR{r}$ & $\dn$ & $\btrue$\\
	\end{tabular}
\end{center}
\noindent
The predicate $\textit{nonAlt}$ evaluates to true when the regular expression is not an
alternative, and false otherwise.
The $\good$ property is preserved under $\rsimp_{ALTS}$, provided that
its non-empty argument list of expressions are all good themsleves, and $\textit{nonAlt}$, 
and unique:
\begin{lemma}\label{rsimpaltsGood}
	If $rs \neq []$ and forall $r \in rs. \textit{nonAlt} \; r$ and $\textit{isDistinct} \; rs$,
	then $\good \; (\rsimpalts \; rs)$ if and only if forall $r \in rs. \; \good \; r$.
\end{lemma}
\noindent
We also note that
if a regular expression $r$ is good, then $\rflts$ on the singleton
list $[r]$ will not break goodness:
\begin{lemma}\label{flts2}
	If $\good \; r$, then forall $r' \in \rflts \; [r]. \; \good \; r'$ and $\textit{nonAlt} \; r'$.
\end{lemma}
\begin{proof}
	By an induction on $r$.
\end{proof}
\noindent
The other observation we make about $\rsimp{r}$ is that it never
comes with nested alternatives, which we describe as the $\nonnested$
property:
\begin{center}
	\begin{tabular}{lcl}
		$\nonnested \; \, \sum []$ & $\dn$ & $\btrue$\\
		$\nonnested \; \, \sum ((\sum rs_1) :: rs_2)$ & $\dn$ & $\bfalse$\\
		$\nonnested \; \, \sum (r :: rs)$ & $\dn$ & $\nonnested (\sum rs)$\\
		$\nonnested \; \, r $ & $\dn$ & $\btrue$
	\end{tabular}
\end{center}
\noindent
The $\rflts$ function
always opens up nested alternatives,
which enables $\rsimp$ to be non-nested:

\begin{lemma}\label{nonnestedRsimp}
	$\nonnested \; (\rsimp{r})$
\end{lemma}
\begin{proof}
	By an induction on $r$.
\end{proof}
\noindent
With this we could prove that a regular expressions
after simplification and flattening and de-duplication,
will not contain any alternative regular expression directly:
\begin{lemma}\label{nonaltFltsRd}
	If $x \in \rdistinct{\rflts\; (\map \; \rsimp{} \; rs)}{\varnothing}$ 
	then $\textit{nonAlt} \; x$.
\end{lemma}
\begin{proof}
	By \ref{nonnestedRsimp}.
\end{proof}
\noindent
The other thing we know is that once $\rsimp{}$ had finished
processing an alternative regular expression, it will not
contain any $\RZERO$s, this is because all the recursive 
calls to the simplification on the children regular expressions
make the children good, and $\rflts$ will not take out
any $\RZERO$s out of a good regular expression list,
and $\rdistinct{}$ will not mess with the result.
\begin{lemma}\label{flts3Obv}
	The following are true:
	\begin{itemize}
		\item
			If for all $r \in rs. \, \good \; r $ or $r = \RZERO$,
			then for all $r \in \rflts\; rs. \, \good \; r$.
		\item
			If $x \in \rdistinct{\rflts\; (\map \; rsimp{}\; rs)}{\varnothing}$
			and for all $y$ such that $\llbracket y \rrbracket_r$ less than
			$\llbracket rs \rrbracket_r + 1$, either
			$\good \; (\rsimp{y})$ or $\rsimp{y} = \RZERO$,
			then $\good \; x$.
	\end{itemize}
\end{lemma}
\begin{proof}
	The first part is by induction on $rs$, where the induction
	rule is the inductive cases for $\rflts$.
	The second part is a corollary from the first part.
\end{proof}

And this leads to good structural property of $\rsimp{}$,
that after simplification, a regular expression is
either good or $\RZERO$:
\begin{lemma}\label{good1}
	For any r-regular expression $r$, $\good \; \rsimp{r}$ or $\rsimp{r} = \RZERO$.
\end{lemma}
\begin{proof}
	By an induction on $r$. The inductive measure is the size $\llbracket \rrbracket_r$.
	Lemma \ref{rsimpSize} says that 
	$\llbracket \rsimp{r}\rrbracket_r$ is smaller than or equal to
	$\llbracket r \rrbracket_r$.
	Therefore, in the $r_1 \cdot r_2$ and $\sum rs$ case,
	Inductive hypothesis applies to the children regular expressions
	$r_1$, $r_2$, etc. The lemma \ref{flts3Obv}'s precondition is satisfied
	by that as well.
	The lemmas \ref{nonnestedRsimp} and  \ref{nonaltFltsRd} are used
	to ensure that goodness is preserved at the topmost level.
\end{proof}
We shall prove that any good regular expression is 
a fixed-point for $\rsimp{}$.
First we prove an auxiliary lemma:
\begin{lemma}\label{goodaltsNonalt}
	If $\good \; \sum rs$, then $\rflts\; rs = rs$.
\end{lemma}
\begin{proof}
	By an induction on $\sum rs$. The inductive rules are the cases
	for $\good$.
\end{proof}
\noindent
Now we are ready to prove that good regular expressions are invariant
of $\rsimp{}$ application:
\begin{lemma}\label{test}
	If $\good \;r$ then $\rsimp{r} = r$.
\end{lemma}
\begin{proof}
	By an induction on the inductive cases of $\good$.
	The lemma \ref{goodaltsNonalt} is used in the alternative
	case where 2 or more elements are present in the list.
\end{proof}
\noindent
Given below is a property involving $\rflts$, $\rdistinct{}{}$, $\rsimp{}$ and $\rsimp_{ALTS}$,
which requires $\ref{good1}$ to go through smoothly.
It says that an application of $\rsimp_{ALTS}$ can be "absorbed",
if it its output is concatenated with a list and then applied to $\rflts$.
\begin{lemma}\label{flattenRsimpalts}
	$\rflts \; ( (\rsimp_{ALTS} \; 
	(\rdistinct{(\rflts \; (\map \; \rsimp{}\; rs))}{\varnothing})) :: 
	\map \; \rsimp{} \; rs' ) = 
	\rflts \; ( (\rdistinct{(\rflts \; (\map \; \rsimp{}\; rs))}{\varnothing}) @ (
	\map \; \rsimp{rs'}))$


\end{lemma}
\begin{proof}
	By \ref{good1}.
\end{proof}
\noindent





We are also 
\subsubsection{$\rsimp$ is Idempotent}
The idempotency of $\rsimp$ is very useful in 
manipulating regular expression terms into desired
forms so that key steps allowing further rewriting to closed forms
are possible.
\begin{lemma}\label{rsimpIdem}
	$\rsimp{r} = \rsimp{\rsimp{r}}$
\end{lemma}

\begin{proof}
	By \ref{test} and \ref{good1}.
\end{proof}
\noindent
This property means we do not have to repeatedly
apply simplification in each step, which justifies
our definition of $\blexersimp$.


On the other hand, we could repeat the same $\rsimp{}$ applications
on regular expressions as many times as we want, if we have at least
one simplification applied to it, and apply it wherever we would like to:
\begin{corollary}\label{headOneMoreSimp}
	The following properties hold, directly from \ref{rsimpIdem}:

	\begin{itemize}
		\item
			$\map \; \rsimp{(r :: rs)} = \map \; \rsimp{} \; (\rsimp{r} :: rs)$
		\item
			$\rsimp{(\RALTS{rs})} = \rsimp{(\RALTS{\map \; \rsimp{} \; rs})}$
	\end{itemize}
\end{corollary}
\noindent
This will be useful in later closed form proof's rewriting steps.
Similarly, we point out the following useful facts below:
\begin{lemma}
	The following equalities hold if $r = \rsimp{r'}$ for some $r'$:
	\begin{itemize}
		\item
			If $r  = \sum rs$ then $\rsimpalts \; rs = \sum rs$.
		\item
			If $r = \sum rs$ then $\rdistinct{rs}{\varnothing} = rs$.
		\item
			$\rsimpalts \; (\rdistinct{\rflts \; [r]}{\varnothing}) = r$.
	\end{itemize}
\end{lemma}
\begin{proof}
	By application of \ref{rsimpIdem} and \ref{good1}.
\end{proof}

\noindent
With the idempotency of $\rsimp{}$ and its corollaries, 
we can start proving some key equalities leading to the 
closed forms.
Now presented are a few equivalent terms under $\rsimp{}$.
We use $r_1 \sequal r_2 $ here to denote $\rsimp{r_1} = \rsimp{r_2}$.
\begin{lemma}
	\begin{itemize}
		The following equivalence hold:
	\item
		$\rsimpalts \; (\RZERO :: rs) \sequal \rsimpalts\; rs$
	\item
		$\rsimpalts \; rs \sequal \rsimpalts (\map \; \rsimp{} \; rs)$
	\item
		$\RALTS{\RALTS{rs}} \sequal \RALTS{rs}$
	\item
		$\sum ((\sum rs_a) :: rs_b) \sequal \sum rs_a @ rs_b$
	\item
		$\RALTS{rs} = \RALTS{\map \; \rsimp{} \; rs}$
\end{itemize}
\end{lemma}
\begin{proof}
	By induction on the lists involved.
\end{proof}
\noindent
The above allows us to prove
two similar equalities (which are a bit more involved).
It says that we could flatten out the elements
before simplification and still get the same result.
\begin{lemma}\label{simpFlatten3}
	One can flatten the inside $\sum$ of a $\sum$ if it is being 
	simplified. Concretely,
	\begin{itemize}
		\item
			If for all $r \in rs, rs', rs''$, we have $\good \; r $
			or $r = \RZERO$, then $\sum (rs' @ rs @ rs'') \sequal 
			\sum (rs' @ [\sum rs] @ rs'')$ holds. As a corollary,
		\item
			$\sum (rs' @ [\sum rs] @ rs'') \sequal \sum (rs' @ rs @ rs'')$
	\end{itemize}
\end{lemma}
\begin{proof}
	By rewriting steps involving the use of \ref{test} and \ref{rdistinctConcatGeneral}.
	The second sub-lemma is a corollary of the previous.
\end{proof}
%Rewriting steps not put in--too long and complicated-------------------------------
\begin{comment}
	\begin{center}
		$\rsimp{\sum (rs' @ rs @ rs'')}  \stackrel{def of bsimp}{=}$  \\
		$\rsimpalts \; (\rdistinct{\rflts \; ((\map \; \rsimp{}\; rs') @ (\map \; \rsimp{} \; rs ) @ (\map \; \rsimp{} \; rs''))}{\varnothing})$ \\
		$\stackrel{by \ref{test}}{=} 
		\rsimpalts \; (\rdistinct{(\rflts \; rs' @ \rflts \; rs @ \rflts \; rs'')}{
		\varnothing})$\\
		$\stackrel{by \ref{rdistinctConcatGeneral}}{=}
		\rsimpalts \; (\rdistinct{\rflts \; rs'}{\varnothing} @ \rdistinct{(
		\rflts\; rs @ \rflts \; rs'')}{\rflts \; rs'})$\\

	\end{center}
\end{comment}
%Rewriting steps not put in--too long and complicated-------------------------------
\noindent
We need more equalities like the above to enable a closed form,
but to proceed we need to introduce two rewrite relations,
to make things smoother.
\subsubsection{The rewrite relation $\hrewrite$ , $\scfrewrites$ , $\frewrite$ and $\grewrite$}
Insired by the success we had in the correctness proof 
in \ref{Bitcoded2}, where we invented
a term rewriting system to capture the similarity between terms,
we follow suit here defining simplification
steps as rewriting steps. This allows capturing 
similarities between terms that would be otherwise
hard to express.

We use $\hrewrite$ for one-step atomic rewrite of 
regular expression simplification, 
$\frewrite$ for rewrite of list of regular expressions that 
include all operations carried out in $\rflts$, and $\grewrite$ for
rewriting a list of regular expressions possible in both $\rflts$ and $\rdistinct{}{}$.
Their reflexive transitive closures are used to denote zero or many steps,
as was the case in the previous chapter.
The presentation will be more concise than that in \ref{Bitcoded2}.
To differentiate between the rewriting steps for annotated regular expressions
and $\rrexp$s, we add characters $h$ and $g$ below the squig arrow symbol
to mean atomic simplification transitions 
of $\rrexp$s and $\rrexp$ lists, respectively.



List of one-step rewrite rules for $\rrexp$ ($\hrewrite$):


\begin{center}
	\begin{mathpar}
		\inferrule[RSEQ0L]{}{\RZERO \cdot r_2 \hrewrite \RZERO\\}

		\inferrule[RSEQ0R]{}{r_1 \cdot \RZERO \hrewrite \RZERO\\}

		\inferrule[RSEQ1]{}{(\RONE \cdot r) \hrewrite  r\\}\\	

		\inferrule[RSEQL]{ r_1 \hrewrite r_2}{r_1 \cdot r_3 \hrewrite r_2 \cdot r_3\\}

		\inferrule[RSEQR]{ r_3 \hrewrite r_4}{r_1 \cdot r_3 \hrewrite r_1 \cdot r_4\\}\\

		\inferrule[RALTSChild]{r \hrewrite r'}{\sum (rs_1 @ [r] @ rs_2) \hrewrite \sum (rs_1 @ [r'] @ rs_2)\\}

		\inferrule[RALTS0]{}{\sum (rs_a @ [\RZERO] @ rs_b) \hrewrite \sum (rs_a @ rs_b)}

		\inferrule[RALTSNested]{}{\sum (rs_a @ [\sum rs_1] @ rs_b) \hrewrite \sum (rs_a @ rs_1 @ rs_b)}

		\inferrule[RALTSNil]{}{ \sum [] \hrewrite \RZERO\\}

		\inferrule[RALTSSingle]{}{ \sum [r] \hrewrite  r\\}	

		\inferrule[RALTSDelete]{\\ r_1 = r_2}{\sum rs_a @ [r_1] @ rs_b @ [r_2] @ rsc \hrewrite \sum rs_a @ [r_1] @ rs_b @ rs_c}

	\end{mathpar}
\end{center}


List of rewrite rules for a list of regular expressions,
where each element can rewrite in many steps to the other (scf stands for
li\emph{s}t \emph{c}losed \emph{f}orm). This relation is similar to the 
$\stackrel{s*}{\rightsquigarrow}$ for annotated regular expressions.

\begin{center}
	\begin{mathpar}
		\inferrule{}{[] \scfrewrites [] }
		\inferrule{r \hrewrites r' \\ rs \scfrewrites rs'}{r :: rs \scfrewrites r' :: rs'}
	\end{mathpar}
\end{center}
%frewrite
List of one-step rewrite rules for flattening 
a list of  regular expressions($\frewrite$):
\begin{center}
	\begin{mathpar}
		\inferrule{}{\RZERO :: rs \frewrite rs \\}

		\inferrule{}{(\sum rs) :: rs_a \frewrite rs @ rs_a \\}

		\inferrule{rs_1 \frewrite rs_2}{r :: rs_1 \frewrite r :: rs_2}
	\end{mathpar}
\end{center}

Lists of one-step rewrite rules for flattening and de-duplicating
a list of regular expressions ($\grewrite$):
\begin{center}
	\begin{mathpar}
		\inferrule{}{\RZERO :: rs \grewrite rs \\}

		\inferrule{}{(\sum rs) :: rs_a \grewrite rs @ rs_a \\}

		\inferrule{rs_1 \grewrite rs_2}{r :: rs_1 \grewrite r :: rs_2}

		\inferrule[dB]{}{rs_a @ [a] @ rs_b @[a] @ rs_c \grewrite rs_a @ [a] @ rsb @ rsc}
	\end{mathpar}
\end{center}

\noindent
The reason why we take the trouble of defining 
two separate list rewriting definitions $\frewrite$ and $\grewrite$
is to separate the two stages of simplification: flattening and de-duplicating.
Sometimes $\grewrites$ is slightly too powerful
so we would rather use $\frewrites$ which makes certain rewriting steps 
more straightforward to prove.
For example, when proving the closed-form for the alternative regular expression,
one of the rewriting steps would be:
\begin{lemma}
	$\sum (\rDistinct \;\; (\map \; (\_ \backslash x) \; (\rflts \; rs)) \;\; \varnothing) \sequal
	\sum (\rDistinct \;\;  (\rflts \; (\map \; (\_ \backslash x) \; rs)) \;\; \varnothing)
	$
\end{lemma}
\noindent
Proving this is by first showing 
\begin{lemma}\label{earlyLaterDerFrewrites}
	$\map \; (\_ \backslash x) \;  (\rflts \; rs) \frewrites
	\rflts \; (\map \; (\_ \backslash x) \; rs)$
\end{lemma}
\noindent
and then using lemma
\begin{lemma}\label{frewritesSimpeq}
	If $rs_1 \frewrites rs_2 $, then $\sum (\rDistinct \; rs_1 \; \varnothing) \sequal 
	\sum (\rDistinct \;  rs_2 \;  \varnothing)$.
\end{lemma}
\noindent
is a piece of cake.
But this trick will not work for $\grewrites$.
For example, a rewriting step in proving
closed forms is:
\begin{center}
	$\rsimp{(\rsimpalts \; (\map \; (\_ \backslash x) \; (\rdistinct{(\rflts \; (\map \; (\rsimp{} \; \circ \; (\lambda r. \rderssimp{r}{xs}))))}{\varnothing})))}$\\
	$=$ \\
	$\rsimp{(\rsimpalts \; (\rdistinct{(\map \; (\_ \backslash x) \; (\rflts \; (\map \; (\rsimp{} \; \circ \; (\lambda r. \rderssimp{r}{xs})))) ) }{\varnothing}))} $
	\noindent
\end{center}
For this one would hope to have a rewriting relation between the two lists involved,
similar to \ref{earlyLaterDerFrewrites}. However, it turns out that 
\begin{center}
	$\map \; (\_ \backslash x) \; (\rDistinct \; rs \; rset) \grewrites \rDistinct \; (\map \;
	(\_ \backslash x) \; rs) \; ( rset \backslash x)$
\end{center}
\noindent
does $\mathbf{not}$ hold in general.
For this rewriting step we will introduce some slightly more cumbersome
proof technique in later sections.
The point is that $\frewrite$
allows us to prove equivalence in a straightforward two-step method that is 
not possible for $\grewrite$, thereby reducing the complexity of the entire proof.


\subsubsection{Terms That Can Be Rewritten Using $\hrewrites$, $\grewrites$, and $\frewrites$}
We present in the below lemma a few pairs of terms that are rewritable via 
$\grewrites$:
\begin{lemma}\label{gstarRdistinctGeneral}
	\begin{itemize}
		\item
			$rs_1 @ rs \grewrites rs_1 @ (\rDistinct \; rs \; rs_1)$
		\item
			$rs \grewrites \rDistinct \; rs \; \varnothing$
		\item
			$rs_a @ (\rDistinct \; rs \; rs_a) \grewrites rs_a @ (\rDistinct \; 
			rs \; (\{\RZERO\} \cup rs_a))$
		\item
			$rs \;\; @ \;\; \rDistinct \; rs_a \; rset \grewrites rs @  \rDistinct \; rs_a \;
			(rest \cup rs)$

	\end{itemize}
\end{lemma}
\noindent
If a pair of terms $rs_1, rs_2$ are rewritable via $\grewrites$ to each other,
then they are equivalent under $\rsimp{}$:
\begin{lemma}\label{grewritesSimpalts}
	If $rs_1 \grewrites rs_2$, then
	we have the following equivalence hold:
	\begin{itemize}
		\item
			$\sum rs_1 \sequal \sum rs_2$
		\item
			$\rsimpalts \; rs_1 \sequal \rsimpalts \; rs_2$
	\end{itemize}
\end{lemma}
\noindent
Here are a few connecting lemmas showing that
if a list of regular expressions can be rewritten using $\grewrites$ or $\frewrites $ or
$\scfrewrites$,
then an alternative constructor taking the list can also be rewritten using $\hrewrites$:
\begin{lemma}
	\begin{itemize}
		\item
			If $rs \grewrites rs'$ then $\sum rs \hrewrites \sum rs'$.
		\item
			If $rs \grewrites rs'$ then $\sum rs \hrewrites \rsimpalts \; rs'$
		\item
			If $rs_1 \scfrewrites rs_2$ then $\sum (rs @ rs_1) \hrewrites \sum (rs @ rs_2)$
		\item
			If $rs_1 \scfrewrites rs_2$ then $\sum rs_1 \hrewrites \sum rs_2$

	\end{itemize}
\end{lemma}
\noindent
Here comes the meat of the proof, 
which says that once two lists are rewritable to each other,
then they are equivalent under $\rsimp{}$:
\begin{lemma}
	If $r_1 \hrewrites r_2$ then $r_1 \sequal r_2$.
\end{lemma}

\noindent
And similar to \ref{Bitcoded2} one can preserve rewritability after taking derivative
of two regular expressions on both sides:
\begin{lemma}\label{interleave}
	If $r \hrewrites r' $ then $\rder{c}{r} \hrewrites \rder{c}{r'}$
\end{lemma}
\noindent
This allows proving more $\mathbf{rsimp}$-equivalent terms, involving $\backslash_r$ now.
\begin{lemma}\label{insideSimpRemoval}
	$\rsimp{\rder{c}{\rsimp{r}}} = \rsimp{\rder{c}{r}}  $
\end{lemma}
\noindent
\begin{proof}
	By \ref{interleave} and \ref{rsimpIdem}.
\end{proof}
\noindent
And this unlocks more equivalent terms:
\begin{lemma}\label{Simpders}
	As corollaries of \ref{insideSimpRemoval}, we have
	\begin{itemize}
		\item
			If $s \neq []$ then $\rderssimp{r}{s} = \rsimp{(\rders \; r \; s)}$.
		\item
			$\rsimpalts \; (\map \; (\_ \backslash_r x) \;
			(\rdistinct{rs}{\varnothing})) \sequal
			\rsimpalts \; (\rDistinct \; 
			(\map \; (\_ \backslash_r x) rs) \;\varnothing  )$
	\end{itemize}
\end{lemma}
\noindent

Finally,
together with 
\begin{lemma}\label{rderRsimpAltsCommute}
	$\rder{x}{(\rsimpalts \; rs)} = \rsimpalts \; (\map \; (\rder{x}{\_}) \; rs)$
\end{lemma}
\noindent
this leads to the first closed form--
\begin{lemma}\label{altsClosedForm}
	\begin{center}
		$\rderssimp{(\sum rs)}{s} \sequal
		\sum \; (\map \; (\rderssimp{\_}{s}) \; rs)$
	\end{center}
\end{lemma}

\noindent
\begin{proof}
	By a reverse induction on the string $s$.
	One rewriting step, as we mentioned earlier,
	involves
	\begin{center}
		$\rsimpalts \; (\map \; (\_ \backslash x) \; 
		(\rdistinct{(\rflts \; (\map \; (\rsimp{} \; \circ \; 
		(\lambda r. \rderssimp{r}{xs}))))}{\varnothing}))
		\sequal
		\rsimpalts \; (\rdistinct{(\map \; (\_ \backslash x) \; 
			(\rflts \; (\map \; (\rsimp{} \; \circ \; 
		(\lambda r. \rderssimp{r}{xs})))) ) }{\varnothing}) $.
	\end{center}
	This can be proven by a combination of 
	\ref{grewritesSimpalts}, \ref{gstarRdistinctGeneral}, \ref{rderRsimpAltsCommute}, and
	\ref{insideSimpRemoval}.
\end{proof}
\noindent
This closed form has a variant which can be more convenient in later proofs:
\begin{corollary}{altsClosedForm1}
	If $s \neq []$ then 
	$\rderssimp \; (\sum \; rs) \; s = 
	\rsimp{(\sum \; (\map \; \rderssimp{\_}{s} \; rs))}$.
\end{corollary}
\noindent
The harder closed forms are the sequence and star ones.
Before we go on to obtain them, some preliminary definitions
are needed to make proof statements concise.




\subsection{Closed Forms}
\subsubsection{Closed Form for Sequence Regular Expressions}
The problem of obataining a closed-form for sequence regular expression 
is constructing $(r_1 \cdot r_2) \backslash_r s$
if we are only allowed to use a combination of $r_1 \backslash s''$ 
and  $r_2 \backslash s''$ , where $s''$ is from $s$.
First let's look at a series of derivatives steps on a sequence 
regular expression, assuming that each time the first
component of the sequence is always nullable):
\begin{center}

	$r_1 \cdot r_2 \quad \longrightarrow_{\backslash c}  \quad   r_1  \backslash c \cdot r_2 + r_2 \backslash c \quad \longrightarrow_{\backslash c'} \quad (r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc' \longrightarrow_{\backslash c''} \quad$\\
	$((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') + r_2 \backslash cc'c''   \longrightarrow_{\backslash c''} \quad
	\ldots$

\end{center}
Roughly speaking $r_1 \cdot r_2 \backslash s$ can be expresssed as 
a giant alternative taking a list of terms 
$[r_1 \backslash_r s \cdot r_2, r_2 \backslash_r s'', r_2 \backslash_r s_1'', \ldots]$,
where the head of the list is always the term
representing a match involving only $r_1$, and the tail of the list consisting of
terms of the shape $r_2 \backslash_r s''$, $s''$ being a suffix of $s$.
This intuition is also echoed by IndianPaper, where they gave
a pencil-and-paper derivation of $(r_1 \cdot r_2)\backslash s$:
\begin{center}
	\begin{tabular}{c}
		$(r_1 \cdot r_2) \backslash_r (c_1 :: c_2 :: \ldots c_n) \myequiv$\\ 
		\rule{0pt}{3ex} $((r_1 \backslash_r c_1) \cdot r_2 + (\delta\; (\rnullable \; r_1) \; r_2 \backslash_r c_1)) \backslash_r (c_2 :: \ldots c_n) 
		\myequiv$\\
		\rule{0pt}{3ex} $((r_1 \backslash_r c_1c_2 \cdot r_2 + (\delta \; (\rnullable \; r_1) \; r_2 \backslash_r c_1c_2))
		+ (\delta \ (\rnullable \; r_1 \backslash_r c)\; r_2 \backslash_r c_2)) \backslash_r (c_3 \ldots c_n)
		$
	\end{tabular}
\end{center}
\noindent
The equality in above should be interpretated
as language equivalence. 
The $\delta$ function works similarly to that of
a Kronecker delta function:
\[ \delta \; b\; r\]
will produce $r$
if $b$ evaluates to true, 
and $\RZERO$ otherwise.
Note that their formulation  
\[
	((r_1 \backslash_r \, c_1c_2 \cdot r_2 + (\delta \; (\rnullable) \; r_1, r_2 \backslash_r c_1c_2)
	+ (\delta \; (\rnullable \; r_1 \backslash_r c)\; r_2 \backslash_r c_2)
\]
does not faithfully
represent what the intermediate derivatives would actually look like
when one or more intermediate results $r_1 \backslash s' \cdot r_2$ are not 
nullable in the head of the sequence.
For example, when $r_1$ and $r_1 \backslash_r c_1$ are not nullable,
the regular expression would not look like 
\[
	(r_1 \backslash_r c_1c_2 + \RZERO ) + \RZERO,
\]
but actually $r_1 \backslash_r c_1c_2$, the redundant $\RZERO$s will not be created in the
first place.
In a closed-form one would want to take into account this 
and generate the list of
regular expressions $r_2 \backslash_r s''$ with
string pairs $(s', s'')$ where $s'@s'' = s$ and
$r_1 \backslash s'$ nullable.
We denote the list consisting of such 
strings $s''$ as $\vsuf{s}{r_1}$.

The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string:
\begin{center}
	\begin{tabular}{lcl}
		$\vsuf{[]}{\_} $ & $=$ &  $[]$\\
		$\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} (\rnullable{r_1}) \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\
				    && $\textit{else} \; (\vsuf{cs}{(\rder{c}{r_1}) })  $
	\end{tabular}
\end{center}
\noindent
The list is sorted in the order $r_2\backslash s''$ 
appears in $(r_1\cdot r_2)\backslash s$.
In essence, $\vsuf{\_}{\_}$ is doing a 
"virtual derivative" of $r_1 \cdot r_2$, but instead of producing 
the entire result $(r_1 \cdot r_2) \backslash s$, 
it only stores all the strings $s''$ such that $r_2 \backslash s''$
are occurring terms in $(r_1\cdot r_2)\backslash s$.

To make the closed form representation 
more straightforward,
the flattetning function $\sflat{\_}$ is used to enable the transformation from 
a left-associative nested sequence of alternatives into 
a flattened list:
\[
	\sum [r_1, r_2, r_3, \ldots] \stackrel{\sflat{\_}}{\rightarrow} 
	(\ldots ((r_1 + r_2) + r_3) + \ldots)
\]
\noindent
The definitions $\sflat{\_}$, $\sflataux{\_}$ are given below.
\begin{center}  
	\begin{tabular}{ccc}
		$\sflataux{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\
		$\sflataux{\AALTS{ }{[]}}$ & $ = $ & $ []$\\
		$\sflataux r$ & $=$ & $ [r]$
	\end{tabular}
\end{center}

\begin{center} 
	\begin{tabular}{ccc}
		$\sflat{(\sum r :: rs)}$ & $=$ & $\sum (\sflataux{r} @ rs)$\\
		$\sflat{\sum []}$ & $ = $ & $ \sum []$\\
		$\sflat r$ & $=$ & $ r$
	\end{tabular}
\end{center}
\noindent
$\sflataux{\_}$ breaks up nested alternative regular expressions 
of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape
into a "balanced" list: $\AALTS{\_}{[r_1,\, r_2 ,\, r_3, \ldots]}$.
It will return the singleton list $[r]$ otherwise.
$\sflat{\_}$ works the same  as $\sflataux{\_}$, except that it keeps
the output type a regular expression, not a list.
$\sflataux{\_}$  and $\sflat{\_}$ are only recursive on the  
first element of the list.

With $\sflataux{}$ a preliminary to the closed form can be stated,
where the derivative of $r_1 \cdot r_2 \backslash s$ can be
flattened into a list whose head and tail meet the description
we gave earlier.
\begin{lemma}\label{seqSfau0}
	$\sflataux{\rders{(r_1 \cdot r_2) \backslash s }} = (r_1 \backslash_r s) \cdot r_2 
	:: (\map \; (r_2 \backslash_r \_) \; (\textit{Suffix} \; s \; r1))$ 
\end{lemma}
\begin{proof}
	By an induction on the string $s$, where the inductive cases 
	are split as $[]$ and $xs @ [x]$.
	Note the key identify holds:
	\[
		\map \; (r_2 \backslash_r \_) \; (\vsuf{[x]}{(r_1 \backslash_r xs)}) \;\; @ \;\;
		\map \; (\_ \backslash_r x) \; (\map \; (r_2 \backslash \_) \; (\vsuf{xs}{r_1}))
	\]
	=
	\[
		\map \; (r_2 \backslash_r \_) \; (\vsuf{xs @ [x]}{r_1})
	\]
	This enables the inductive case to go through.
\end{proof}
\noindent 
Note that this lemma does $\mathbf{not}$ depend on any
specific definitions we used,
allowing people investigating derivatives to get an alternative
view of what $r_1 \cdot r_2$ is.

Now we are able to use this for the intuition that 
the different ways in which regular expressions are 
nested do not matter under $\rsimp{}$:
\begin{center}
	$\rsimp{r} \stackrel{?}{\sequal} \rsimp{r'}$ if $r = \sum [r_1, r_2, r_3, \ldots]$ 
	and $r' =(\ldots ((r_1 + r_2) + r_3) + \ldots)$
\end{center}
Simply wrap with $\sum$ constructor and add 
simplifications to both sides of \ref{seqSfau0}
and one gets
\begin{corollary}\label{seqClosedFormGeneral}
	$\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} }
	=\rsimp{(\sum (  (r_1 \backslash s) \cdot r_2 :: 
	\map\; (r_2 \backslash \_) \; (\vsuf{s}{r_1})))}$
\end{corollary}
Together with the idempotency property of $\rsimp{}$ (lemma \ref{rsimpIdem}),
it is possible to convert the above lemma to obtain a "closed form"
for  derivatives nested with simplification:
\begin{lemma}\label{seqClosedForm}
	$\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{(\sum ((r_1 \backslash s) \cdot r_2 ) 
	:: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1})))}$
\end{lemma}
\begin{proof}
	By a case analysis of string $s$.
	When $s$ is empty list, the rewrite is straightforward.
	When $s$ is a list, one could use the corollary \ref{seqSfau0},
	and lemma \ref{Simpders} to rewrite the left-hand-side.
\end{proof}
As a corollary for this closed form, one can estimate the size 
of the sequence derivative $r_1 \cdot r_2 \backslash_r s$ using 
an easier-to-handle expression:
\begin{corollary}\label{seqEstimate1}
	\begin{center}

		$\llbracket \rderssimp{(r_1 \cdot r_2)}{s} \rrbracket_r = \llbracket \rsimp{(\sum ((r_1 \backslash s) \cdot r_2 ) 
		:: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1})))} \rrbracket_r$

	\end{center}
\end{corollary}
\noindent
\subsubsection{Closed Forms for Star Regular Expressions}
We have shown how to control the size of the sequence regular expression $r_1\cdot r_2$ using
the "closed form" of $(r_1 \cdot r_2) \backslash s$ and then 
the property of the $\distinct$ function.
Now we try to get a bound on $r^* \backslash s$ as well.
Again, we first look at how a star's derivatives evolve, if they grow maximally: 
\begin{center}

	$r^* \quad \longrightarrow_{\backslash c}  \quad   (r\backslash c)  \cdot  r^* \quad \longrightarrow_{\backslash c'}  \quad
	r \backslash cc'  \cdot r^* + r \backslash c' \cdot r^*  \quad \longrightarrow_{\backslash c''} \quad 
	(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*)   \quad \longrightarrow_{\backslash c'''}
	\quad \ldots$

\end{center}
When we have a string $s = c :: c' :: c'' \ldots$  such that $r \backslash c$, $r \backslash cc'$, $r \backslash c'$, 
$r \backslash cc'c''$, $r \backslash c'c''$, $r\backslash c''$ etc. are all nullable,
the number of terms in $r^* \backslash s$ will grow exponentially, causing the size
of the derivatives $r^* \backslash s$ to grow exponentially, even if we do not 
count the possible size explosions of $r \backslash c$ themselves.

Thanks to $\rflts$ and $\rDistinct$, we are able to open up regular expressions like
$(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + 
(r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ 
into $\RALTS{[r_1 \backslash cc'c'' \cdot r^*, r \backslash c'', 
r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$
and then de-duplicate terms of the form $r\backslash s' \cdot r^*$ ($s'$ being a substring of $s$).
This allows us to use a similar technique as $r_1 \cdot r_2$ case,
where the crux is to get an equivalent form of 
$\rderssimp{r^*}{s}$ with shape $\rsimp{\sum rs}$.
This requires generating 
all possible sub-strings $s'$ of $s$
such that $r\backslash s' \cdot r^*$ will appear 
as a term in $(r^*) \backslash s$.
The first function we define is a single-step
updating function $\starupdate$, which takes three arguments as input:
the new character $c$ to take derivative with, 
the regular expression
$r$ directly under the star $r^*$, and the
list of strings $sSet$ for the derivative $r^* \backslash s$ 
up til this point  
such that $(r^*) \backslash s = \sum_{s' \in sSet} (r\backslash s') \cdot r^*$ 
(the equality is not exact, more on this later).
\begin{center}
	\begin{tabular}{lcl}
		$\starupdate \; c \; r \; [] $ & $\dn$ & $[]$\\
		$\starupdate \; c \; r \; (s :: Ss)$ & $\dn$ & \\
						     & & $\textit{if} \; 
						     (\rnullable \; (\rders \; r \; s))$ \\
						     & & $\textit{then} \;\; (s @ [c]) :: [c] :: (
						     \starupdate \; c \; r \; Ss)$ \\
						     & & $\textit{else} \;\; (s @ [c]) :: (
						     \starupdate \; c \; r \; Ss)$
	\end{tabular}
\end{center}
\noindent
As a generalisation from characters to strings,
$\starupdates$ takes a string instead of a character
as the first input argument, and is otherwise the same
as $\starupdate$.
\begin{center}
	\begin{tabular}{lcl}
		$\starupdates \; [] \; r \; Ss$ & $=$ & $Ss$\\
		$\starupdates \; (c :: cs) \; r \; Ss$ &  $=$ &  $\starupdates \; cs \; r \; (
		\starupdate \; c \; r \; Ss)$
	\end{tabular}
\end{center}
\noindent
For the star regular expression,
its derivatives can be seen as  a nested gigantic
alternative similar to that of sequence regular expression's derivatives, 
and therefore need
to be ``straightened out" as well.
The function for this would be $\hflat{}$ and $\hflataux{}$.
\begin{center}
	\begin{tabular}{lcl}
		$\hflataux{r_1 + r_2}$ & $\dn$ & $\hflataux{r_1} @ \hflataux{r_2}$\\
		$\hflataux{r}$ & $\dn$ & $[r]$
	\end{tabular}
\end{center}

\begin{center}
	\begin{tabular}{lcl}
		$\hflat{r_1 + r_2}$ & $\dn$ & $\sum (\hflataux {r_1} @ \hflataux {r_2}) $\\
		$\hflat{r}$ & $\dn$ & $r$
	\end{tabular}
\end{center}
\noindent
%MAYBE TODO: introduce createdByStar
Again these definitions are tailor-made for dealing with alternatives that have
originated from a star's derivatives, so we do not attempt to open up all possible 
regular expressions of the form $\RALTS{rs}$, where $\textit{rs}$ might not contain precisely 2
elements.
We give a predicate for such "star-created" regular expressions:
\begin{center}
	\begin{tabular}{lcr}
	 &    &       $\createdByStar{(\RSEQ{ra}{\RSTAR{rb}}) }$\\
		$\createdByStar{r_1} \land \createdByStar{r_2} $ & $ \Longrightarrow$ & $\createdByStar{(r_1 + r_2)}$
	\end{tabular}
\end{center}

These definitions allows us the flexibility to talk about 
regular expressions in their most convenient format,
for example, flattened out $\RALTS{[r_1, r_2, \ldots, r_n]} $
instead of binary-nested: $((r_1 + r_2) + (r_3 + r_4)) + \ldots$.
These definitions help express that certain classes of syntatically 
distinct regular expressions are actually the same under simplification.
This is not entirely true for annotated regular expressions: 
%TODO: bsimp bders \neq bderssimp
\begin{center}
	$(1+ (c\cdot \ASEQ{bs}{c^*}{c} ))$
\end{center}
For bit-codes, the order in which simplification is applied
might cause a difference in the location they are placed.
If we want something like
\begin{center}
	$\bderssimp{r}{s} \myequiv \bsimp{\bders{r}{s}}$
\end{center}
Some "canonicalization" procedure is required,
which either pushes all the common bitcodes to nodes
as senior as possible:
\begin{center}
	$_{bs}(_{bs_1 @ bs'}r_1 + _{bs_1 @ bs''}r_2) \rightarrow _{bs @ bs_1}(_{bs'}r_1 + _{bs''}r_2) $
\end{center}
or does the reverse. However bitcodes are not of interest if we are talking about
the $\llbracket r \rrbracket$ size of a regex.
Therefore for the ease and simplicity of producing a
proof for a size bound, we are happy to restrict ourselves to 
unannotated regular expressions, and obtain such equalities as
\begin{lemma}
	$\rsimp{r_1 + r_2} = \rsimp{\RALTS{\hflataux{r_1} @ \hflataux{r_2}}}$
\end{lemma}

\begin{proof}
	By using the rewriting relation $\rightsquigarrow$
\end{proof}
%TODO: rsimp sflat
And from this we obtain a proof that a star's derivative will be the same
as if it had all its nested alternatives created during deriving being flattened out:
For example,
\begin{lemma}
	$\createdByStar{r} \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$
\end{lemma}
\begin{proof}
	By structural induction on $r$, where the induction rules are these of $\createdByStar{_}$.
\end{proof}
% The simplification of a flattened out regular expression, provided it comes
%from the derivative of a star, is the same as the one nested.



We first introduce an inductive property
for $\starupdate$ and $\hflataux{\_}$, 
it says if we do derivatives of $r^*$
with a string that starts with $c$,
then flatten it out,
we obtain a list
of the shape $\sum_{s' \in sSet} (r\backslash_r s') \cdot r^*$,
where $sSet = \starupdates \; s \; r \; [[c]]$.
\begin{lemma}\label{starHfauInduct}
	$\hflataux{(\rders{( (\rder{c}{r_0})\cdot(r_0^*))}{s})} = 
	\map \; (\lambda s_1. (r_0 \backslash_r s_1) \cdot (r_0^*)) \; 
	(\starupdates \; s \; r_0 \; [[c]])$
\end{lemma}
\begin{proof}
	By an induction on $s$, the inductive cases
	being $[]$ and $s@[c]$.
\end{proof}
\noindent
Here is a corollary that states the lemma in
a more intuitive way:
\begin{corollary}
	$\hflataux{r^* \backslash_r (c::xs)} = \map \; (\lambda s. (r \backslash_r s) \cdot
	(r^*))\; (\starupdates \; c\; r\; [[c]])$
\end{corollary}
\noindent
Note that this is also agnostic of the simplification
function we defined, and is therefore of more general interest.

Now adding the $\rsimp{}$ bit for closed forms,
we have
\begin{lemma}
	$a :: rs \grewrites \hflataux{a} @ rs$
\end{lemma}
\noindent
giving us
\begin{lemma}\label{cbsHfauRsimpeq1}
	$\rsimp{a+b} = \rsimp{(\sum \hflataux{a} @ \hflataux{b})}$.
\end{lemma}
\noindent
This yields
\begin{lemma}\label{hfauRsimpeq2}
	$\rsimp{r} = \rsimp{(\sum \hflataux{r})}$
\end{lemma}
\noindent
Together with the rewriting relation
\begin{lemma}\label{starClosedForm6Hrewrites}
	$\map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss
	\scfrewrites
	\map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss$
\end{lemma}
\noindent
We obtain the closed form for star regular expression:
\begin{lemma}\label{starClosedForm}
	$\rderssimp{r^*}{c::s} = 
	\rsimp{
		(\sum (\map \; (\lambda s. (\rderssimp{r}{s})\cdot r^*) \; 
		(\starupdates \; s\; r \; [[c]])
		)
		)
	}
	$
\end{lemma}
\begin{proof}
	By an induction on $s$.
	The lemmas \ref{rsimpIdem}, \ref{starHfauInduct}, and \ref{hfauRsimpeq2}
	are used.	
\end{proof}






\subsection{Estimating the Closed Forms' sizes}
We now summarize the closed forms below:
\begin{itemize}
	\item
		$\rderssimp{(\sum rs)}{s} \sequal
		\sum \; (\map \; (\rderssimp{\_}{s}) \; rs)$
	\item
		$\rderssimp{(r_1 \cdot r_2)}{s} \sequal \sum ((r_1 \backslash s) \cdot r_2 ) 
		:: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1}))$
	\item

		$\rderssimp{r^*}{c::s} = 
		\rsimp{
			(\sum (\map \; (\lambda s. (\rderssimp{r}{s})\cdot r^*) \; 
			(\starupdates \; s\; r \; [[c]])
			)
			)
		}
		$
\end{itemize}	
\noindent	
The closed forms on the left-hand-side
are all of the same shape: $\rsimp{ (\sum rs)} $.
Such regular expression will be bounded by the size of $\sum rs'$, 
where every element in $rs'$ is distinct, and each element 
can be described by some inductive sub-structures 
(for example when $r = r_1 \cdot r_2$ then $rs'$ 
will be solely comprised of $r_1 \backslash s'$ 
and $r_2 \backslash s''$, $s'$ and $s''$ being 
sub-strings of $s$).
which will each have a size uppder bound 
according to inductive hypothesis, which controls $r \backslash s$.

We elaborate the above reasoning by a series of lemmas
below, where straightforward proofs are omitted.
\begin{lemma}
	If $\forall r \in rs. \rsize{r} $ is less than or equal to $N$,
	and $\textit{length} \; rs$ is less than or equal to $l$,
	then $\rsize{\sum rs}$ is less than or equal to $l*N + 1$.
\end{lemma}
\noindent
If we define all regular expressions with size no
more than $N$ as $\sizeNregex \; N$:
\[
	\sizeNregex \; N \dn  \{r \mid \rsize{r} \leq N \}
\]
Then such set is finite:
\begin{lemma}\label{finiteSizeN}
	$\textit{isFinite}\; (\sizeNregex \; N)$
\end{lemma}
\begin{proof}
	By overestimating the set $\sizeNregex \; N + 1$
	using union of sets like
	$\{r_1 \cdot r_2 \mid r_1 \in A
		\text{and}
	r_2 \in A\}
	$ where $A = \sizeNregex \; N$.
\end{proof}
\noindent
From this we get a corollary that
if forall $r \in rs$, $\rsize{r} \leq N$, then the output of 
$\rdistinct{rs}{\varnothing}$ is a list of regular
expressions of finite size depending on $N$ only. 
\begin{corollary}\label{finiteSizeNCorollary}
	Assumes that for all $r \in rs. \rsize{r} \leq N$,
	and the cardinality of $\sizeNregex \; N$ is $c_N$
	then$\rsize{\rdistinct{rs}{\varnothing}} \leq c*N$.
\end{corollary}
\noindent
We have proven that the output of $\rdistinct{rs'}{\varnothing}$
is bounded by a constant $c_N$ depending only on $N$,
provided that each of $rs'$'s element
is bounded by $N$.
We want to apply it to our setting $\rsize{\rsimp{\sum rs}}$.

We show that $\rdistinct$ and $\rflts$
working together is at least as 
good as $\rdistinct{}{}$ alone, which can be written as
\begin{center}
	$\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r 
	\leq 
	\llbracket \rdistinct{rs}{\varnothing}  \rrbracket_r  $.
\end{center}
We need this so that we know the outcome of our real 
simplification is better than or equal to a rough estimate,
and therefore can be bounded by that estimate.
This is a bit harder to establish compared with proving
$\textit{flts}$ does not make a list larger (which can
be proven using routine induction):
\begin{center}
	$\llbracket  \textit{rflts}\; rs \rrbracket_r \leq
	\llbracket  \textit{rs} \rrbracket_r$
\end{center}
We cannot simply prove how each helper function
reduces the size and then put them together:
From
\begin{center}
$\llbracket  \textit{rflts}\; rs \rrbracket_r \leq
	\llbracket \; \textit{rs} \rrbracket_r$
\end{center}
and
\begin{center}
     $\llbracket  \textit{rdistinct} \; rs \; \varnothing \leq
     \llbracket rs \rrbracket_r$
\end{center}
one cannot imply
\begin{center}
	$\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r 
	\leq 
	\llbracket \rdistinct{rs}{\varnothing}  \rrbracket_r  $.
\end{center}
What we can imply is that 
\begin{center}
	$\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r 
	\leq
	\llbracket rs \rrbracket_r$
\end{center}
but this estimate is too rough and $\llbracket rs \rrbracket_r$	is unbounded.
The way we 
get through this is by first proving a more general lemma 
(so that the inductive case goes through):
\begin{lemma}\label{fltsSizeReductionAlts}
	If we have three accumulator sets:
	$noalts\_set$, $alts\_set$ and $corr\_set$,
	satisfying:
	\begin{itemize}
		\item
			$\forall r \in noalts\_set. \; \nexists xs.\; r = \sum  xs$
		\item
			$\forall r \in alts\_set. \; \exists xs. \; r = \sum xs
			\; \textit{and} \; set \; xs \subseteq corr\_set$
	\end{itemize}
	then we have that
	\begin{center}
	\begin{tabular}{lcl}
	$\llbracket  (\textit{rdistinct} \; (\textit{rflts} \; as) \;
	(noalts\_set \cup corr\_set)) \rrbracket_r$ & $\leq$ &\\
						    $\llbracket  (\textit{rdistinct} \; as \; (noalts\_set \cup alts\_set \cup
	\{ \ZERO \} )) \rrbracket_r$ & & \\ 
	\end{tabular}
	\end{center}
		holds.
\end{lemma}
\noindent
We need to split the accumulator into two parts: the part
which contains alternative regular expressions ($alts\_set$), and 
the part without any of them($noalts\_set$).
The set $corr\_set$ is the corresponding set
of $alts\_set$ with all elements under the $\sum$ constructor
spilled out.
\begin{proof}
	By induction on the list $as$. We make use of lemma \ref{rdistinctConcat}.
\end{proof}
By setting all three sets to the empty set, one gets the desired size estimate:
\begin{corollary}\label{interactionFltsDB}
	$\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r 
	\leq 
	\llbracket \rdistinct{rs}{\varnothing}  \rrbracket_r  $.
\end{corollary}
\begin{proof}
	By using the lemma \ref{fltsSizeReductionAlts}.
\end{proof}
\noindent
The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of 
duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$. 

Now this $\rsimp{\sum rs}$ can be estimated using $\rdistinct{rs}{\varnothing}$:
\begin{lemma}\label{altsSimpControl}
	$\rsize{\rsimp{\sum rs}} \leq \rsize{\rdistinct{rs}{\varnothing}}+ 1$
\end{lemma}
\begin{proof}
	By using \ref{interactionFltsDB}.
\end{proof}
\noindent
This is a key lemma in establishing the bounds on all the 
closed forms.
With this we are now ready to control the sizes of
$(r_1 \cdot r_2 )\backslash s$, $r^* \backslash s$.
\begin{theorem}
	For any regex $r$, $\exists N_r. \forall s. \; \rsize{\rderssimp{r}{s}} \leq N_r$
\end{theorem}
\noindent
\begin{proof}
	We prove this by induction on $r$. The base cases for $\RZERO$,
	$\RONE $ and $\RCHAR{c}$ are straightforward. 
	In the sequence $r_1 \cdot r_2$ case,
	the inductive hypotheses state 
	$\exists N_1. \forall s. \; \llbracket \rderssimp{r}{s} \rrbracket \leq N_1$ and
	$\exists N_2. \forall s. \; \llbracket \rderssimp{r_2}{s} \rrbracket \leq N_2$. 

	When the string $s$ is not empty, we can reason as follows
	%
	\begin{center}
		\begin{tabular}{lcll}
& & $ \llbracket   \rderssimp{r_1\cdot r_2 }{s} \rrbracket_r $\\
& $ = $ & $\llbracket \rsimp{(\sum(r_1 \backslash_{rsimp} s \cdot r_2 \; \;  :: \; \; 
		\map \; (r_2\backslash_{rsimp} \_)\; (\vsuf{s}{r})))} \rrbracket_r $ & (1) \\			
										     & $\leq$ & $\llbracket \rdistinct{(r_1 \backslash_{rsimp} s \cdot r_2 \; \;  :: \; \; 
	\map \; (r_2\backslash_{rsimp} \_)\; (\vsuf{s}{r}))}{\varnothing} \rrbracket_r  + 1$ & (2) \\
											     & $\leq$ & $2 + N_1 + \rsize{r_2} + (N_2 * (card\;(\sizeNregex \; N_2)))$ & (3)\\
\end{tabular}
\end{center}
\noindent
(1) is by the corollary \ref{seqEstimate1}.
(2) is by \ref{altsSimpControl}.
(3) is by \ref{finiteSizeNCorollary}.


Combining the cases when $s = []$ and $s \neq []$, we get (4):
\begin{center}
	\begin{tabular}{lcll}
		$\rsize{(r_1 \cdot r_2) \backslash_r s}$ & $\leq$ & 
		$max \; (2 + N_1 + 
		\llbracket r_2 \rrbracket_r + 
		N_2 * (card\; (\sizeNregex \; N_2))) \; \rsize{r_1\cdot r_2}$ & (4)
	\end{tabular}
\end{center}

We reason similarly for  $\STAR$.
The inductive hypothesis is
$\exists N. \forall s. \; \llbracket \rderssimp{r}{s} \rrbracket \leq N$.
Let $n_r = \llbracket r^* \rrbracket_r$.
When $s = c :: cs$ is not empty,
\begin{center}
	\begin{tabular}{lcll}
& & $ \llbracket   \rderssimp{r^* }{c::cs} \rrbracket_r $\\
& $ = $ & $\llbracket \rsimp{(\sum (\map \; (\lambda s. (r \backslash_{rsimp} s) \cdot r^*) \; (\starupdates\; 
	cs \; r \; [[c]] )) )} \rrbracket_r $ & (5) \\			
					      & $\leq$ & $\llbracket 
					      \rdistinct{
						      (\map \; 
						      (\lambda s. (r \backslash_{rsimp} s) \cdot r^*) \; 
						      (\starupdates\; cs \; r \; [[c]] )
					      )}
	{\varnothing} \rrbracket_r  + 1$ & (6) \\
					 & $\leq$ & $1 + (\textit{card} (\sizeNregex \; (N + n_r)))
	* (1 + (N + n_r)) $ & (7)\\
\end{tabular}
\end{center}
\noindent
(5) is by the lemma  \ref{starClosedForm}.
(6) is by \ref{altsSimpControl}.
(7) is by \ref{finiteSizeNCorollary}.
Combining with the case when $s = []$, one gets
\begin{center}
	\begin{tabular}{lcll}
		$\rsize{r^* \backslash_r s}$ & $\leq$ & $max \; n_r \; 1 + (\textit{card} (\sizeNregex \; (N + n_r)))
		* (1 + (N + n_r)) $ & (8)\\
	\end{tabular}
\end{center}
\noindent

The alternative case is slightly less involved.
The inductive hypothesis 
is equivalent to $\exists N. \forall r \in (\map \; (\_ \backslash_r s) \; rs). \rsize{r} \leq N$.
In the case when $s = c::cs$, we have 
\begin{center}
	\begin{tabular}{lcll}
& & $ \llbracket   \rderssimp{\sum rs }{c::cs} \rrbracket_r $\\
& $ = $ & $\llbracket \rsimp{(\sum (\map \; (\_ \backslash_{rsimp} s)  \; rs) )} \rrbracket_r $ & (9) \\			
& $\leq$ & $\llbracket (\sum (\map \; (\_ \backslash_{rsimp} s)  \; rs) ) \rrbracket_r $  & (10) \\
& $\leq$ & $1 + N * (length \; rs) $ & (11)\\
	\end{tabular}
\end{center}
\noindent
(9) is by \ref{altsClosedForm}, (10) by \ref{rsimpSize} and (11) by inductive hypothesis.

Combining with the case when $s = []$, one gets
\begin{center}
	\begin{tabular}{lcll}
		$\rsize{\sum rs \backslash_r s}$ & $\leq$ & $max \; \rsize{\sum rs} \; 1+N*(length \; rs)$ 
						 & (12)\\
	\end{tabular}
\end{center}
(4), (8), and (12) are all the inductive cases proven.
\end{proof}


\begin{corollary}
	For any regex $a$, $\exists N_r. \forall s. \; \rsize{\bderssimp{a}{s}} \leq N_r$
\end{corollary}
\begin{proof}
	By \ref{sizeRelations}.
\end{proof}
\noindent





%-----------------------------------
%	SECTION 2
%-----------------------------------


%----------------------------------------------------------------------------------------
%	SECTION 3
%----------------------------------------------------------------------------------------


\subsection{A Closed Form for the Sequence Regular Expression}
\noindent

Before we get to the proof that says the intermediate result of our lexer will
remain finitely bounded, which is an important efficiency/liveness guarantee,
we shall first develop a few preparatory properties and definitions to 
make the process of proving that a breeze.

We define rewriting relations for $\rrexp$s, which allows us to do the 
same trick as we did for the correctness proof,
but this time we will have stronger equalities established.



What guarantee does this bound give us?

Whatever the regex is, it will not grow indefinitely.
Take our previous example $(a + aa)^*$ as an example:
\begin{center}
	\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
		\begin{tikzpicture}
			\begin{axis}[
				xlabel={number of $a$'s},
				x label style={at={(1.05,-0.05)}},
				ylabel={regex size},
				enlargelimits=false,
				xtick={0,5,...,30},
				xmax=33,
				ymax= 40,
				ytick={0,10,...,40},
				scaled ticks=false,
				axis lines=left,
				width=5cm,
				height=4cm, 
				legend entries={$(a + aa)^*$},  
				legend pos=north west,
				legend cell align=left]
				\addplot[red,mark=*, mark options={fill=white}] table {a_aa_star.data};
			\end{axis}
		\end{tikzpicture}
	\end{tabular}
\end{center}
We are able to limit the size of the regex $(a + aa)^*$'s derivatives
with our simplification
rules very effectively.


In our proof for the inductive case $r_1 \cdot r_2$, the dominant term in the bound
is $l_{N_2} * N_2$, where $N_2$ is the bound we have for $\llbracket \bderssimp{r_2}{s} \rrbracket$.
Given that $l_{N_2}$ is roughly the size $4^{N_2}$, the size bound $\llbracket \bderssimp{r_1 \cdot r_2}{s} \rrbracket$
inflates the size bound of $\llbracket \bderssimp{r_2}{s} \rrbracket$ with the function
$f(x) = x * 2^x$.
This means the bound we have will surge up at least
tower-exponentially with a linear increase of the depth.
For a regex of depth $n$, the bound
would be approximately $4^n$.

Test data in the graphs from randomly generated regular expressions
shows that the giant bounds are far from being hit.
%a few sample regular experessions' derivatives
%size change
%TODO: giving regex1_size_change.data showing a few regular expressions' size changes 
%w;r;t the input characters number, where the size is usually cubic in terms of original size
%a*, aa*, aaa*, .....
%randomly generated regular expressions
\begin{center}
	\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
		\begin{tikzpicture}
			\begin{axis}[
				xlabel={number of $a$'s},
				x label style={at={(1.05,-0.05)}},
				ylabel={regex size},
				enlargelimits=false,
				xtick={0,5,...,30},
				xmax=33,
				ymax=1000,
				ytick={0,100,...,1000},
				scaled ticks=false,
				axis lines=left,
				width=5cm,
				height=4cm, 
				legend entries={regex1},  
				legend pos=north west,
				legend cell align=left]
				\addplot[red,mark=*, mark options={fill=white}] table {regex1_size_change.data};
			\end{axis}
		\end{tikzpicture}
  &
  \begin{tikzpicture}
	  \begin{axis}[
		  xlabel={$n$},
		  x label style={at={(1.05,-0.05)}},
		  %ylabel={time in secs},
		  enlargelimits=false,
		  xtick={0,5,...,30},
		  xmax=33,
		  ymax=1000,
		  ytick={0,100,...,1000},
		  scaled ticks=false,
		  axis lines=left,
		  width=5cm,
		  height=4cm, 
		  legend entries={regex2},  
		  legend pos=north west,
		  legend cell align=left]
		  \addplot[blue,mark=*, mark options={fill=white}] table {regex2_size_change.data};
	  \end{axis}
  \end{tikzpicture}
  &
  \begin{tikzpicture}
	  \begin{axis}[
		  xlabel={$n$},
		  x label style={at={(1.05,-0.05)}},
		  %ylabel={time in secs},
		  enlargelimits=false,
		  xtick={0,5,...,30},
		  xmax=33,
		  ymax=1000,
		  ytick={0,100,...,1000},
		  scaled ticks=false,
		  axis lines=left,
		  width=5cm,
		  height=4cm, 
		  legend entries={regex3},  
		  legend pos=north west,
		  legend cell align=left]
		  \addplot[cyan,mark=*, mark options={fill=white}] table {regex3_size_change.data};
	  \end{axis}
  \end{tikzpicture}\\
  \multicolumn{3}{c}{Graphs: size change of 3 randomly generated regular expressions $w.r.t.$ input string length.}
	\end{tabular}    
\end{center}  
\noindent
Most of the regex's sizes seem to stay within a polynomial bound $w.r.t$ the 
original size.
We will discuss improvements to this bound in the next chapter.



%----------------------------------------------------------------------------------------
%	SECTION ??
%----------------------------------------------------------------------------------------

%-----------------------------------
%	SECTION syntactic equivalence under simp
%-----------------------------------
\section{Syntactic Equivalence Under $\simp$}
We prove that minor differences can be annhilated
by $\simp$.
For example,
\begin{center}
	$\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) = 
	\simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$
\end{center}


%----------------------------------------------------------------------------------------
%	SECTION ALTS CLOSED FORM
%----------------------------------------------------------------------------------------
%\section{A Closed Form for \textit{ALTS}}
%Now we prove that  $rsimp (rders\_simp (RALTS rs) s) = rsimp (RALTS (map (\lambda r. rders\_simp r s) rs))$.
%
%
%There are a few key steps, one of these steps is
%
%
%
%One might want to prove this by something a simple statement like: 
%
%For this to hold we want the $\textit{distinct}$ function to pick up
%the elements before and after derivatives correctly:
%$r \in rset \equiv (rder x r) \in (rder x rset)$.
%which essentially requires that the function $\backslash$ is an injective mapping.
%
%Unfortunately the function $\backslash c$ is not an injective mapping.
%
%\subsection{function $\backslash c$ is not injective (1-to-1)}
%\begin{center}
%	The derivative $w.r.t$ character $c$ is not one-to-one.
%	Formally,
%	$\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$
%\end{center}
%This property is trivially true for the
%character regex example:
%\begin{center}
%	$r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$
%\end{center}
%But apart from the cases where the derivative
%output is $\ZERO$, are there non-trivial results
%of derivatives which contain strings?
%The answer is yes.
%For example,
%\begin{center}
%	Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\
%	where $a$ is not nullable.\\
%	$r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\
%	$r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$
%\end{center}
%We start with two syntactically different regular expressions,
%and end up with the same derivative result.
%This is not surprising as we have such 
%equality as below in the style of Arden's lemma:\\
%\begin{center}
%	$L(A^*B) = L(A\cdot A^* \cdot B + B)$
%\end{center}

\section{Further Improvements to the Bound}
There are two problems with this finiteness result, though.
\begin{itemize}
	\item
		First, It is not yet a direct formalisation of our lexer's complexity,
		as a complexity proof would require looking into 
		the time it takes to execute {\bf all} the operations
		involved in the lexer (simp, collect, decode), not just the derivative.
	\item
		Second, the bound is not yet tight, and we seek to improve $N_a$ so that
		it is polynomial on $\llbracket a \rrbracket$.
\end{itemize}
Still, we believe this contribution is fruitful,
because
\begin{itemize}
	\item

		The size proof can serve as a cornerstone for a complexity
		formalisation.
		Derivatives are the most important phases of our lexer algorithm.
		Size properties about derivatives covers the majority of the algorithm
		and is therefore a good indication of complexity of the entire program.
	\item
		The bound is already a strong indication that catastrophic
		backtracking is much less likely to occur in our $\blexersimp$
		algorithm.
		We refine $\blexersimp$ with $\blexerStrong$ in the next chapter
		so that the bound becomes polynomial.
\end{itemize}

%----------------------------------------------------------------------------------------
%	SECTION 4
%----------------------------------------------------------------------------------------








One might wonder the actual bound rather than the loose bound we gave
for the convenience of an easier proof.
How much can the regex $r^* \backslash s$ grow? 
As  earlier graphs have shown,
%TODO: reference that graph where size grows quickly
they can grow at a maximum speed
exponential $w.r.t$ the number of characters, 
but will eventually level off when the string $s$ is long enough.
If they grow to a size exponential $w.r.t$ the original regex, our algorithm
would still be slow.
And unfortunately, we have concrete examples
where such regular expressions grew exponentially large before levelling off:
$(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + 
(\underbrace{a \ldots a}_{\text{n a's}})^*$ will already have a maximum
size that is  exponential on the number $n$ 
under our current simplification rules:
%TODO: graph of a regex whose size increases exponentially.
\begin{center}
	\begin{tikzpicture}
		\begin{axis}[
			height=0.5\textwidth,
			width=\textwidth,
			xlabel=number of a's,
			xtick={0,...,9},
			ylabel=maximum size,
			ymode=log,
			log basis y={2}
			]
			\addplot[mark=*,blue] table {re-chengsong.data};
		\end{axis}
	\end{tikzpicture}
\end{center}

For convenience we use $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$
to express $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + 
(\underbrace{a \ldots a}_{\text{n a's}})^*$ in the below discussion.
The exponential size is triggered by that the regex
$\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$
inside the $(\ldots) ^*$ having exponentially many
different derivatives, despite those difference being minor.
$(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*\backslash \underbrace{a \ldots a}_{\text{m a's}}$
will therefore contain the following terms (after flattening out all nested 
alternatives):
\begin{center}
	$(\oplus_{i = 1]{n}  (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* })\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\
	$(1 \leq m' \leq m )$
\end{center}
These terms are distinct for $m' \leq L.C.M.(1, \ldots, n)$ (will be explained in appendix).
With each new input character taking the derivative against the intermediate result, more and more such distinct
terms will accumulate, 
until the length reaches $L.C.M.(1, \ldots, n)$.
$\textit{distinctBy}$ will not be able to de-duplicate any two of these terms 
$(\oplus_{i = 1}^{n}  (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\

$(\oplus_{i = 1}^{n}  (\underbrace{a \ldots a}_{\text{((i - (m'' \% i))\%i) a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\
where $m' \neq m''$ \\
as they are slightly different.
This means that with our current simplification methods,
we will not be able to control the derivative so that
$\llbracket \bderssimp{r}{s} \rrbracket$ stays polynomial %\leq O((\llbracket r\rrbacket)^c)$
as there are already exponentially many terms.
These terms are similar in the sense that the head of those terms
are all consisted of sub-terms of the form: 
$(\underbrace{a \ldots a}_{\text{j a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* $.
For  $\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$, there will be at most
$n * (n + 1) / 2$ such terms. 
For example, $(a^* + (aa)^* + (aaa)^*) ^*$'s derivatives
can be described by 6 terms:
$a^*$, $a\cdot (aa)^*$, $ (aa)^*$, 
$aa \cdot (aaa)^*$, $a \cdot (aaa)^*$, and $(aaa)^*$.
The total number of different "head terms",  $n * (n + 1) / 2$,
is proportional to the number of characters in the regex 
$(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$.
This suggests a slightly different notion of size, which we call the 
alphabetic width:
%TODO:
(TODO: Alphabetic width def.)


Antimirov\parencite{Antimirov95} has proven that 
$\textit{PDER}_{UNIV}(r) \leq \textit{awidth}(r)$.
where $\textit{PDER}_{UNIV}(r)$ is a set of all possible subterms
created by doing derivatives of $r$ against all possible strings.
If we can make sure that at any moment in our lexing algorithm our 
intermediate result hold at most one copy of each of the 
subterms then we can get the same bound as Antimirov's.
This leads to the algorithm in the next chapter.





%----------------------------------------------------------------------------------------
%	SECTION 1
%----------------------------------------------------------------------------------------


%-----------------------------------
%	SUBSECTION 1
%-----------------------------------
\subsection{Syntactic Equivalence Under $\simp$}
We prove that minor differences can be annhilated
by $\simp$.
For example,
\begin{center}
	$\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) = 
	\simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$
\end{center}
author	Chengsong
	Mon, 03 Oct 2022 02:08:49 +0100
changeset 609	61139fdddae0
parent 601	ce4e5151a836
child 610	d028c662a3df
permissions	-rwxr-xr-x