ChengsongTanPhdThesis/Chapters/Finite.tex
author Chengsong
Mon, 07 Nov 2022 21:31:07 +0000
changeset 620 ae6010c14e49
parent 618 233cf2b97d1a
child 624 8ffa28fce271
permissions -rwxr-xr-x
chap6 almost done

% Chapter Template

\chapter{Finiteness Bound} % Main chapter title

\label{Finite} 
%  In Chapter 4 \ref{Chapter4} we give the second guarantee
%of our bitcoded algorithm, that is a finite bound on the size of any 
%regex's derivatives. 


In this chapter we give a guarantee in terms of size of derivatives: 
given an annotated regular expression $a$, for any string $s$
our algorithm $\blexersimp$'s internal annotated regular expression 
size  is finitely bounded
by a constant  that only depends on $a$.
Formally we show that there exists an $N_a$ such that
\begin{center}
	$\llbracket \bderssimp{a}{s} \rrbracket \leq N_a$
\end{center}
\noindent
where the size ($\llbracket \_ \rrbracket$) of 
an annotated regular expression is defined
in terms of the number of nodes in its 
tree structure (its recursive definition given in the next page).
We believe this size bound
is important in the context of POSIX lexing, because 
\begin{itemize}
	\item
		It is a stepping stone towards the goal 
		of eliminating ``catastrophic backtracking''. 
		If the internal data structures used by our algorithm
		grows beyond a finite bound, then clearly 
		the algorithm (which traverses these structures) will
		be slow.
		The next step would be to refine the bound $N_a$ so that it
		is polynomial on $\llbracket a\rrbracket$.
	\item
		Having the finite bound formalised 
		gives us a higher confidence that
		our simplification algorithm $\simp$ does not ``mis-behave''
		like $\simpsulz$ does.
		The bound is universal for a given regular expression, 
		which is an advantage over work which 
		only gives empirical evidence on some test cases (see Verbatim++\cite{Verbatimpp}).
\end{itemize}
In the next section we describe in more detail
what the finite bound means in our algorithm
and why the size of the internal data structures of
a typical derivative-based lexer such as
Sulzmann and Lu's need formal treatment.

\section{Formalising About Size}
\noindent
In our lexer ($\blexersimp$),
we take an annotated regular expression as input,
and repeately take derivative of and simplify it.
\begin{figure}
	\begin{center}
		\begin{tabular}{lcl}
			$\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\
			$\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\
			$\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\
			$\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\
			$\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as   + 1$\\
			$\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$.
		\end{tabular}
	\end{center}
	\caption{The size function of bitcoded regular expressions}\label{brexpSize}
\end{figure}

\begin{figure}
	\begin{tikzpicture}[scale=2,
		every node/.style={minimum size=11mm},
		->,>=stealth',shorten >=1pt,auto,thick
		]
		\node (r0) [rectangle, draw=black, thick, minimum size = 5mm, draw=blue] {$a$};
		\node (r1) [rectangle, draw=black, thick, right=of r0, minimum size = 7mm]{$a_1$};
		\draw[->,line width=0.2mm](r0)--(r1) node[above,midway] {$\backslash c_1$};

		\node (r1s) [rectangle, draw=blue, thick, right=of r1, minimum size=6mm]{$a_{1s}$};
		\draw[->, line width=0.2mm](r1)--(r1s) node[above, midway] {$\simp$};

		\node (r2) [rectangle, draw=black, thick,  right=of r1s, minimum size = 12mm]{$a_2$};
		\draw[->,line width=0.2mm](r1s)--(r2) node[above,midway] {$\backslash c_2$};

		\node (r2s) [rectangle, draw = blue, thick, right=of r2,minimum size=6mm]{$a_{2s}$};
		\draw[->,line width=0.2mm](r2)--(r2s) node[above,midway] {$\simp$};

		\node (rns) [rectangle, draw = blue, thick, right=of r2s,minimum size=6mm]{$a_{ns}$};
		\draw[->,line width=0.2mm, dashed](r2s)--(rns) node[above,midway] {$\backslash \ldots$};

		\node (v) [circle, thick, draw, right=of rns, minimum size=6mm, right=1.7cm]{$v$};
		\draw[->, line width=0.2mm](rns)--(v) node[above, midway] {\bmkeps} node [below, midway] {\decode};
	\end{tikzpicture}
	\caption{Regular expression size change during our $\blexersimp$ algorithm}\label{simpShrinks}
\end{figure}

\noindent
Each time
a derivative is taken, the regular expression might grow.
However, the simplification that is immediately afterwards will always shrink it so that 
it stays relatively small.
This intuition is depicted by the relative size
change between the black and blue nodes:
After $\simp$ the node always shrinks.
Our proof states that all the blue nodes
stay below a size bound $N_a$ determined by the input $a$.

\noindent
Sulzmann and Lu's assumed a similar picture about their algorithm,
though in fact their algorithm's size might be better depicted by the following graph:
\begin{figure}[H]
	\begin{tikzpicture}[scale=2,
		every node/.style={minimum size=11mm},
		->,>=stealth',shorten >=1pt,auto,thick
		]
		\node (r0) [rectangle, draw=black, thick, minimum size = 5mm, draw=blue] {$a$};
		\node (r1) [rectangle, draw=black, thick, right=of r0, minimum size = 7mm]{$a_1$};
		\draw[->,line width=0.2mm](r0)--(r1) node[above,midway] {$\backslash c_1$};

		\node (r1s) [rectangle, draw=blue, thick, right=of r1, minimum size=7mm]{$a_{1s}$};
		\draw[->, line width=0.2mm](r1)--(r1s) node[above, midway] {$\simp'$};

		\node (r2) [rectangle, draw=black, thick,  right=of r1s, minimum size = 17mm]{$a_2$};
		\draw[->,line width=0.2mm](r1s)--(r2) node[above,midway] {$\backslash c_2$};

		\node (r2s) [rectangle, draw = blue, thick, right=of r2,minimum size=14mm]{$a_{2s}$};
		\draw[->,line width=0.2mm](r2)--(r2s) node[above,midway] {$\simp'$};

		\node (r3) [rectangle, draw = black, thick, right= of r2s, minimum size = 22mm]{$a_3$};
		\draw[->,line width=0.2mm](r2s)--(r3) node[above,midway] {$\backslash c_3$};

		\node (rns) [right = of r3, draw=blue, minimum size = 20mm]{$a_{3s}$};
		\draw[->,line width=0.2mm] (r3)--(rns) node [above, midway] {$\simp'$};

		\node (rnn) [right = of rns, minimum size = 1mm]{};
		\draw[->, dashed] (rns)--(rnn) node [above, midway] {$\ldots$};

	\end{tikzpicture}
	\caption{Regular expression size change during our $\blexersimp$ algorithm}\label{sulzShrinks}
\end{figure}
\noindent
The picture means that on certain cases their lexer (where they use $\simpsulz$ 
as the simplification function)
will have a size explosion, causing the running time 
of each derivative step to grow continuously (for example 
in \ref{SulzmannLuLexerTime}).
They tested out the run time of their
lexer on particular examples such as $(a+b+ab)^*$
and claimed that their algorithm is linear w.r.t to the input.
With our mecahnised proof, we avoid this type of unintentional
generalisation.\\

Before delving in to the details of the formalisation,
we are going to provide an overview of it.
In the next subsection, we give a high-level view
of the proof.


\subsection{Overview of the Proof}
A high-level overview of the main components of the finiteness proof
is as follows:
\begin{figure}[H]
	\begin{tikzpicture}[scale=1,font=\bf,
		node/.style={
			rectangle,rounded corners=3mm,
			ultra thick,draw=black!50,minimum height=18mm, 
			minimum width=20mm,
		top color=white,bottom color=black!20}]


		\node (0) at (-5,0) 
			[node, text width=1.8cm, text centered] 
			{$\llbracket \bderssimp{a}{s} \rrbracket$};
		\node (A) at (0,0) 
			[node,text width=1.6cm,  text centered] 
			{$\llbracket \rderssimp{r}{s} \rrbracket_r$};
		\node (B) at (3,0) 
			[node,text width=3.0cm, anchor=west, minimum width = 40mm] 
			{$\llbracket \textit{ClosedForm}(r, s)\rrbracket_r$};
		\node (C) at (9.5,0) [node, minimum width=10mm] {$N_r$};

		\draw [->,line width=0.5mm] (0) -- 
			node [above,pos=0.45] {=} (A) node [below, pos = 0.45] {$(r = a \downarrow_r)$} (A); 
		\draw [->,line width=0.5mm] (A) -- 
			node [above,pos=0.35] {$\quad =\ldots=$} (B); 
		\draw [->,line width=0.5mm] (B) -- 
			node [above,pos=0.35] {$\quad \leq \ldots \leq$} (C); 
	\end{tikzpicture}
	%\caption{
\end{figure}
\noindent
We explain the steps one by one:
\begin{itemize}
	\item
		We first introduce the operations such as 
		derivatives, simplification, size calculation, etc.
		associated with $\rrexp$s, which we have introduced
		in chapter \ref{Bitcoded2}.
		The operations on $\rrexp$s are identical to those on
		annotated regular expressions except that they dispense with
		bitcodes. This means that all proofs about size of $\rrexp$s will apply to
		annotated regular expressions, because the size of a regular
		expression is independent of the bitcodes.
	\item
		We prove that $\rderssimp{r}{s} = \textit{ClosedForm}(r, s)$,
		where $\textit{ClosedForm}(r, s)$ is entirely 
		given as the derivatives of their children regular 
		expressions.
		We call the right-hand-side the \emph{Closed Form}
		of the derivative $\rderssimp{r}{s}$.
	\item
		Formally we give an estimate of 
		$\llbracket \textit{ClosedForm}(r, s) \rrbracket_r$.
		The key observation is that $\distinctBy$'s output is
		a list with a constant length bound.
\end{itemize}
We will expand on these steps in the next sections.\\

\section{The $\textit{Rrexp}$ Datatype}
The first step is to define 
$\textit{rrexp}$s.
They are annotated regular expressions without bitcodes,
allowing a much simpler size bound proof.
%Of course, the bits which encode the lexing information 
%would grow linearly with respect 
%to the input, which should be taken into account when we wish to tackle the runtime comlexity.
%But for the sake of the structural size 
%we can safely ignore them.\\
The datatype 
definition of the $\rrexp$, called
\emph{r-regular expressions},
was initially defined in \ref{rrexpDef}.
The reason for the prefix $r$ is
to make a distinction  
with basic regular expressions.
\[			\rrexp ::=   \RZERO \mid  \RONE
	\mid  \RCHAR{c}  
	\mid  \RSEQ{r_1}{r_2}
	\mid  \RALTS{rs}
	\mid \RSTAR{r}        
\]
The size of an r-regular expression is
written $\llbracket r\rrbracket_r$, 
whose definition mirrors that of an annotated regular expression. 
\begin{center}
	\begin{tabular}{lcl}
		$\llbracket _{bs}\ONE \rrbracket_r$ & $\dn$ & $1$\\
		$\llbracket \ZERO \rrbracket_r$ & $\dn$ & $1$ \\
		$\llbracket _{bs} r_1 \cdot r_2 \rrbracket_r$ & $\dn$ & $\llbracket r_1 \rrbracket_r + \llbracket r_2 \rrbracket_r + 1$\\
		$\llbracket _{bs}\mathbf{c} \rrbracket_r $ & $\dn$ & $1$\\
		$\llbracket _{bs}\sum as \rrbracket_r $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket_r)\; as   + 1$\\
		$\llbracket _{bs} a^* \rrbracket_r $ & $\dn$ & $\llbracket a \rrbracket_r + 1$.
	\end{tabular}
\end{center}
\noindent
The $r$ in the subscript of $\llbracket \rrbracket_r$ is to 
differentiate with the same operation for annotated regular expressions.
Adding $r$ as subscript will be used in 
other operations as well.\\
The transformation from an annotated regular expression
to an r-regular expression is straightforward.
\begin{center}
	\begin{tabular}{lcl}
		$\rerase{\ZERO}$ & $\dn$ & $\RZERO$\\
		$\rerase{_{bs}\ONE}$ & $\dn$ & $\RONE$\\
		$\rerase{_{bs}\mathbf{c}}$ & $\dn$ & $\RCHAR{c}$\\
		$\rerase{_{bs}r_1\cdot r_2}$ & $\dn$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\
		$\rerase{_{bs}\sum as}$ & $\dn$ & $\RALTS{\map \; \rerase{\_} \; as}$\\
		$\rerase{_{bs} a ^*}$ & $\dn$ & $\rerase{a} ^*$
	\end{tabular}
\end{center}

\subsection{Why a New Datatype?}
The reason we
define a new datatype is that 
the $\erase$ function 
does not preserve the structure of annotated
regular expressions.
We initially started by using 
plain regular expressions and tried to prove
lemma \ref{rsizeAsize},
however the $\erase$ function messes with the structure of the 
annotated regular expression.
The $+$ constructor
of basic regular expressions is only binary, whereas $\sum$ 
takes a list. Therefore we need to convert between
annotated and normal regular expressions as follows:
\begin{center}
	\begin{tabular}{lcl}
		$\erase \; _{bs}\sum [] $ & $\dn$ & $\ZERO$\\
		$\erase \; _{bs}\sum [a]$ & $\dn$ & $a$\\
		$\erase \; _{bs}\sum a :: as$ & $\dn$ & $a + (\erase \; _{[]} \sum as)\quad \text{if $as$ length over 1}$
	\end{tabular}
\end{center}
\noindent
As can be seen alternative regular expression with an empty argument list
will be turned into a $\ZERO$.
The singleton alternative $\sum [r]$ becomes $r$ during the
$\erase$ function.
The  annotated regular expression $\sum[a, b, c]$ would turn into
$(a+(b+c))$.
All these operations change the size and structure of
an annotated regular expressions, adding unnecessary 
complexities to the size bound proof.\\
For example, if we define the size of a basic plain regular expression 
in the usual way,
\begin{center}
	\begin{tabular}{lcl}
		$\llbracket \ONE \rrbracket_p$ & $\dn$ & $1$\\
		$\llbracket \ZERO \rrbracket_p$ & $\dn$ & $1$ \\
		$\llbracket r_1 \cdot r_2 \rrbracket_p$ & $\dn$ & $\llbracket r_1 \rrbracket_p + \llbracket r_2 \rrbracket_p + 1$\\
		$\llbracket \mathbf{c} \rrbracket_p $ & $\dn$ & $1$\\
		$\llbracket r_1 \cdot r_2 \rrbracket_p $ & $\dn$ & $\llbracket r_1 \rrbracket_p \; + \llbracket r_2 \rrbracket_p + 1$\\
		$\llbracket a^* \rrbracket_p $ & $\dn$ & $\llbracket a \rrbracket_p + 1$
	\end{tabular}
\end{center}
\noindent
Then the property
\begin{center}
	$\llbracket a \rrbracket \stackrel{?}{=} \llbracket a_\downarrow \rrbracket_p$
\end{center}
does not hold.
With $\textit{rerase}$, however, 
only the bitcodes are thrown away.
Everything about the structure remains intact.
Therefore it does not change the size
of an annotated regular expression and we have:
\begin{lemma}\label{rsizeAsize}
	$\rsize{\rerase a} = \asize a$
\end{lemma}
\begin{proof}
	By routine structural induction on $a$.
\end{proof}
\noindent
One might be able to prove an inequality such as
$\llbracket a \rrbracket  \leq \llbracket  a_\downarrow \rrbracket_p $
and then estimate $\llbracket  a_\downarrow \rrbracket_p$,
but we found our approach more straightforward.\\

\subsection{Functions for R-regular Expressions}
In this section we shall define the r-regular expression version
of $\blexer$, and $\blexersimp$ related functions.
We use $r$ as the prefix or subscript to differentiate
with the bitcoded version.
%For example,$\backslash_r$, $\rdistincts$, and $\rsimp$
%as opposed to $\backslash$, $\distinctBy$, and $\bsimp$.
%As promised, they are much simpler than their bitcoded counterparts.
%The operations on r-regular expressions are 
%almost identical to those of the annotated regular expressions,
%except that no bitcodes are used. For example,
The derivative operation for an r-regular expression is\\
\begin{center}
	\begin{tabular}{@{}lcl@{}}
		$(\ZERO)\,\backslash_r c$ & $\dn$ & $\ZERO$\\  
		$(\ONE)\,\backslash_r c$ & $\dn$ &
		$\textit{if}\;c=d\; \;\textit{then}\;
		\ONE\;\textit{else}\;\ZERO$\\  
		$(\sum \;\textit{rs})\,\backslash_r c$ & $\dn$ &
		$\sum\;(\textit{map} \; (\_\backslash_r c) \; rs )$\\
		$(r_1\cdot r_2)\,\backslash_r c$ & $\dn$ &
		$\textit{if}\;(\textit{rnullable}\,r_1)$\\
						 & &$\textit{then}\;\sum\,[(r_1\,\backslash_r c)\cdot\,r_2,$\\
						 & &$\phantom{\textit{then},\;\sum\,}((r_2\,\backslash_r c))]$\\
						 & &$\textit{else}\;\,(r_1\,\backslash_r c)\cdot r_2$\\
		$(r^*)\,\backslash_r c$ & $\dn$ &
		$( r\,\backslash_r c)\cdot
		(_{[]}r^*))$
	\end{tabular}    
\end{center}  
\noindent
where we omit the definition of $\textit{rnullable}$.
The generalisation from derivative w.r.t to character to
derivative w.r.t string is given as
\begin{center}
	\begin{tabular}{lcl}
		$r \backslash_{rs} []$ & $\dn$ & $r$\\
		$r \backslash_{rs} c::s$ & $\dn$ & $(r\backslash_r c) \backslash_{rs} s$
	\end{tabular}
\end{center}

The function $\distinctBy$ for r-regular expressions does not need 
a function checking equivalence because
there are no bit annotations.
Therefore we have
\begin{center}
	\begin{tabular}{lcl}
		$\rdistinct{[]}{rset} $ & $\dn$ & $[]$\\
		$\rdistinct{r :: rs}{rset}$ & $\dn$ & 
		$\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\
					    &        & $\textit{else}\; \;
					    r::\rdistinct{rs}{(rset \cup \{r\})}$
	\end{tabular}
\end{center}
%TODO: definition of rsimp (maybe only the alternative clause)
\noindent
%We would like to make clear
%a difference between our $\rdistincts$ and
%the Isabelle $\textit {distinct}$ predicate.
%In Isabelle $\textit{distinct}$ is a function that returns a boolean
%rather than a list.
%It tests if all the elements of a list are unique.\\
With $\textit{rdistinct}$ in place,
the flatten function for $\rrexp$ is as follows:
 \begin{center}
  \begin{tabular}{@{}lcl@{}}
  $\textit{rflts} \; (\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $as \; @ \; \textit{rflts} \; as' $ \\
  $\textit{rflts} \; \ZERO :: as'$ & $\dn$ & $ \textit{rflts} \;  \textit{as'} $ \\
    $\textit{rflts} \; a :: as'$ & $\dn$ & $a :: \textit{rflts} \; \textit{as'}$ \quad(otherwise) 
\end{tabular}    
\end{center}  
\noindent
The function 
$\rsimpalts$ corresponds to $\textit{bsimp}_{ALTS}$:
\begin{center}
  \begin{tabular}{@{}lcl@{}}
	  $\rsimpalts \;\; nil$ & $\dn$ & $\RZERO$\\
	  $\rsimpalts \;\; r::nil$ & $\dn$ & $r$\\
	  $\rsimpalts \;\; rs$ & $\dn$ & $\sum rs$\\
\end{tabular}    
\end{center}  
\noindent
Similarly, we have $\rsimpseq$ which corresponds to $\textit{bsimp}_{SEQ}$:
\begin{center}
  \begin{tabular}{@{}lcl@{}}
	  $\rsimpseq \;\; \RZERO \; \_ $ &   $=$ &   $\RZERO$\\
	  $\rsimpseq \;\; \_ \; \RZERO $ &   $=$ &   $\RZERO$\\
	  $\rsimpseq \;\; \RONE \cdot r_2$ & $\dn$ & $r_2$\\
	  $\rsimpseq \;\; r_1 r_2$ & $\dn$ & $r_1 \cdot r_2$\\
\end{tabular}    
\end{center}  
and get $\textit{rsimp}$ and $\rderssimp{\_}{\_}$:
\begin{center}
  \begin{tabular}{@{}lcl@{}}
   
	  $\textit{rsimp} \; (r_1\cdot r_2)$ & $\dn$ & $ \textit{rsimp}_{SEQ} \; bs \;(\textit{rsimp} \; r_1) \; (\textit{rsimp}  \; r_2)  $ \\
	  $\textit{rsimp} \; (_{bs}\sum \textit{rs})$ & $\dn$ & $\textit{rsimp}_{ALTS} \; \textit{bs} \; (\textit{rdistinct} \; ( \textit{rflts} ( \textit{map} \; rsimp \; rs)) \; \rerases \; \varnothing) $ \\
   $\textit{rsimp} \; r$ & $\dn$ & $\textit{r} \qquad \textit{otherwise}$   
\end{tabular}    
\end{center} 
\begin{center}
	\begin{tabular}{@{}lcl@{}}
		$r\backslash_{rsimp} \, c$ & $\dn$ & $\rsimp \; (r\backslash_r \, c)$
	\end{tabular}
\end{center}

\begin{center}
	\begin{tabular}{@{}lcl@{}}
$r \backslash_{rsimps} \; \; c\!::\!s $ & $\dn$ & $(r \backslash_{rsimp}\, c) \backslash_{rsimps}\, s$ \\
$r \backslash_{rsimps} [\,] $ & $\dn$ & $r$
	\end{tabular}
\end{center}
\noindent
We do not define an r-regular expression version of $\blexersimp$,
as our proof does not depend on it.
Now we are ready to introduce how r-regular expressions allow
us to prove the size bound on bitcoded regular expressions.

\subsection{Using R-regular Expressions to Bound Bit-coded Regular Expressions}
Everything about the size of annotated regular expressions after the application
of function $\bsimp$ and $\backslash_{simps}$
can be calculated via the size of r-regular expressions after the application
of $\rsimp$ and $\backslash_{rsimps}$:
\begin{lemma}\label{sizeRelations}
	The following two equalities hold:
	\begin{itemize}
		\item
			$\asize{\bsimps \; a} = \rsize{\rsimp{ \rerase{a}}}$
		\item
			$\asize{\bderssimp{a}{s}} =  \rsize{\rderssimp{\rerase{a}}{s}}$
	\end{itemize}
\end{lemma}
\begin{proof}
	The first part is by induction on the inductive cases
	of $\textit{bsimp}$.
	The second part is by induction on the string $s$,
	where the inductive step follows from part one.
\end{proof}
\noindent
With lemma \ref{sizeRelations},
we will be able to focus on 
estimating only
$\rsize{\rderssimp{\rerase{a}}{s}}$
in later parts because
\begin{center}
	$\rsize{\rderssimp{\rerase{a}}{s}} \leq N_r \quad$
	implies
	$\quad \llbracket a \backslash_{bsimps} s \rrbracket \leq N_r$.
\end{center}
%From now on we 
%Unless stated otherwise in the rest of this 
%chapter all regular expressions without
%bitcodes are seen as r-regular expressions ($\rrexp$s).
%For the binary alternative r-regular expression $\RALTS{[r_1, r_2]}$,
%we use the notation $r_1 + r_2$
%for brevity.


%-----------------------------------
%	SUB SECTION ROADMAP RREXP BOUND
%-----------------------------------

%\subsection{Roadmap to a Bound for $\textit{Rrexp}$}

%The way we obtain the bound for $\rrexp$s is by two steps:
%\begin{itemize}
%	\item
%		First, we rewrite $r\backslash s$ into something else that is easier
%		to bound. This step is especially important for the inductive case 
%		$r_1 \cdot r_2$ and $r^*$, where the derivative can grow and bloat in a wild way,
%		but after simplification they will always be equal or smaller to a form consisting of an alternative
%		list of regular expressions $f \; (g\; (\sum rs))$ with some functions applied to it, where each element will be distinct after the function application.
%	\item
%		Then, for such a sum  list of regular expressions $f\; (g\; (\sum rs))$, we can control its size
%		by estimation, since $\distinctBy$ and $\flts$ are well-behaved and working together would only 
%		reduce the size of a regular expression, not adding to it.
%\end{itemize}
%
%\section{Step One: Closed Forms}
%We transform the function application $\rderssimp{r}{s}$
%into an equivalent 
%form $f\; (g \; (\sum rs))$.
%The functions $f$ and $g$ can be anything from $\flts$, $\distinctBy$ and other helper functions from $\bsimp{\_}$.
%This way we get a different but equivalent way of expressing : $r\backslash s = f \; (g\; (\sum rs))$, we call the
%right hand side the "closed form" of $r\backslash s$.
%
%\begin{quote}\it
%	Claim: For regular expressions $r_1 \cdot r_2$, we claim that
%\end{quote}
%\noindent
%We explain in detail how we reached those claims.
If we attempt to prove 
\begin{center}
	$\forall r. \; \exists N_r.\;\; s.t. \llbracket r\backslash_{rsimps} s \rrbracket_r \leq N_r$
\end{center}
using a naive induction on the structure of $r$,
then we are stuck at the inductive cases such as
$r_1\cdot r_2$.
The inductive hypotheses are:
\begin{center}
	1: $\text{for } r_1, \text{there exists } N_{r_1}.\;\; s.t. 
	\;\;\forall s.  \llbracket r_1 \backslash_{rsimps} s \rrbracket_r \leq N_{r_1}. $\\
	2: $\text{for } r_2, \text{there exists } N_{r_2}.\;\; s.t. 
	\;\; \forall s. \llbracket r_2 \backslash_{rsimps} s \rrbracket_r \leq N_{r_2}. $
\end{center}
The inductive step to prove would be 
\begin{center}
	$\text{there exists } N_{r_1\cdot r_2}. \;\; s.t. \forall s. 
	\llbracket (r_1 \cdot r_2) \backslash_{rsimps} s \rrbracket_r \leq N_{r_1\cdot r_2}.$
\end{center}
The problem is that it is not clear what 
$(r_1\cdot r_2) \backslash_{rsimps} s$ looks like,
and therefore $N_{r_1}$ and $N_{r_2}$ in the
inductive hypotheses cannot be directly used.
%We have already seen that $(r_1 \cdot r_2)\backslash s$ 
%and $(r^*)\backslash s$ can grow in a wild way.

The point however, is that they will be equivalent to a list of
terms $\sum rs$, where each term in $rs$ will
be made of $r_1 \backslash s' $, $r_2\backslash s'$,
and $r \backslash s'$ with $s' \in \textit{SubString} \; s$ (which stands
for the set of substrings of $s$).
The list $\sum rs$ will then be de-duplicated by $\textit{rdistinct}$
in the simplification, which prevents the $rs$ from growing indefinitely.

Based on this idea, we develop a proof in two steps.
First, we show the equality (where
$f$ and $g$ are functions that do not increase the size of the input)
\begin{center}
$r\backslash_{rsimps} s = f\; (\textit{rdistinct} \; (g\; \sum rs))$,
\end{center}
where $r = r_1 \cdot r_2$ or $r = r_0^*$ and so on.
For example, for $r_1 \cdot r_2$ we have the equality as
	\begin{center}
		$ \rderssimp{r_1 \cdot r_2}{s} = 
		\rsimp{(\sum (r_1 \backslash s \cdot r_2 ) \; :: \;(\map \; \rderssimp{r2}{\_} \;(\vsuf{s}{r_1})))}$
	\end{center}
We call the right-hand-side the 
\emph{Closed Form} of $(r_1 \cdot r_2)\backslash_{rsimps} s$.
Second, we will bound the closed form of r-regular expressions
using some estimation techniques
and then apply
lemma \ref{sizeRelations} to show that the bitcoded regular expressions
in our $\blexersimp$ are finitely bounded.

We will describe in detail the first step of the proof
in the next section.

\section{Closed Forms}
In this section we introduce in detail
how we express the string derivatives
of regular expressions (i.e. $r \backslash_r s$ where $s$ is a string
rather than a single character) in a different way than 
our previous definition.
In previous chapters, the derivative of a 
regular expression $r$ w.r.t a string $s$
was recursively defined on the string:
\begin{center}
	$r \backslash_s (c::s) \dn (r \backslash c) \backslash_s s$
\end{center}
The problem is that 
this definition does not provide much information
on what $r \backslash_s s$ looks like.
If we are interested in the size of a derivative like 
$(r_1 \cdot r_2)\backslash s$,
we have to somehow get a more concrete form to begin.
We call such more concrete representations the ``closed forms'' of
string derivatives as opposed to their original definitions.
The terminology ``closed form'' is borrowed from mathematics,
which usually describe expressions that are solely comprised of
well-known and easy-to-compute operations such as 
additions, multiplications, exponential functions.

We start by proving some basic identities
involving the simplification functions for r-regular expressions.
After that we introduce the rewrite relations
$\rightsquigarrow_h$, $\rightsquigarrow^*_{scf}$
$\rightsquigarrow_f$ and $\rightsquigarrow_g$.
These relations involves similar techniques as in chapter \ref{Bitcoded2}
for annotated regular expressions.
Finally, we use these identities to establish the
closed forms of the alternative regular expression,
the sequence regular expression, and the star regular expression.
%$r_1\cdot r_2$, $r^*$ and $\sum rs$.



\subsection{Some Basic Identities}

In what follows we will often convert between lists
and sets.
We use Isabelle's $set$ to refere to the 
function that converts a list $rs$ to the set
containing all the elements in $rs$.
\subsubsection{$\textit{rdistinct}$'s Does the Job of De-duplication}
The $\textit{rdistinct}$ function, as its name suggests, will
de-duplicate an r-regular expression list.
It will also remove any elements that 
is already in the accumulator set.
\begin{lemma}\label{rdistinctDoesTheJob}
	%The function $\textit{rdistinct}$ satisfies the following
	%properties:
	Assume we have the predicate $\textit{isDistinct}$\footnote{We omit its
	recursive definition here. Its Isabelle counterpart would be $\textit{distinct}$.} 
	for testing
	whether a list's elements are unique. Then the following
	properties about $\textit{rdistinct}$ hold:
	\begin{itemize}
		\item
			If $a \in acc$ then $a \notin (\rdistinct{rs}{acc})$.
		\item
			%If list $rs'$ is the result of $\rdistinct{rs}{acc}$,
			$\textit{isDistinct} \;\;\; (\rdistinct{rs}{acc})$.
		\item
			$\textit{set} \; (\rdistinct{rs}{acc}) 
			= (textit{set} \; rs) - acc$
	\end{itemize}
\end{lemma}
\noindent
\begin{proof}
	The first part is by an induction on $rs$.
	The second and third part can be proven by using the 
	inductive cases of $\textit{rdistinct}$.

\end{proof}

\noindent
%$\textit{rdistinct}$ will out all regular expression terms
%that are in the accumulator, therefore 
Concatenating a list $rs_a$ at the front of another
list $rs$ whose elements are all from the accumulator, and then calling $\textit{rdistinct}$
on the merged list, the output will be as if we had called $\textit{rdistinct}$
without the prepending of $rs$:
\begin{lemma}\label{rdistinctConcat}
	The elements appearing in the accumulator will always be removed.
	More precisely,
	\begin{itemize}
		\item
			If $rs \subseteq rset$, then 
			$\rdistinct{rs@rsa }{acc} = \rdistinct{rsa }{acc}$.
		\item
			More generally, if $a \in rset$ and $\rdistinct{rs}{\{a\}} = []$,
			then $\rdistinct{(rs @ rs')}{rset} = \rdistinct{rs'}{rset}$
	\end{itemize}
\end{lemma}

\begin{proof}
	By induction on $rs$ and using \ref{rdistinctDoesTheJob}.
\end{proof}
\noindent
On the other hand, if an element $r$ does not appear in the input list waiting to be deduplicated,
then expanding the accumulator to include that element will not cause the output list to change:
\begin{lemma}\label{rdistinctOnDistinct}
	The accumulator can be augmented to include elements not appearing in the input list,
	and the output will not change.	
	\begin{itemize}
		\item
			If $r \notin rs$, then $\rdistinct{rs}{acc} = \rdistinct{rs}{(\{r\} \cup acc)}$.
		\item
			Particularly, if $\;\;\textit{isDistinct} \; rs$, then we have\\
			\[ \rdistinct{rs}{\varnothing} = rs \]
	\end{itemize}
\end{lemma}
\begin{proof}
	The first half is by induction on $rs$. The second half is a corollary of the first.
\end{proof}
\noindent
The function $\textit{rdistinct}$ removes duplicates from anywhere in a list.
Despite being seemingly obvious, 
the induction technique is not as straightforward.
\begin{lemma}\label{distinctRemovesMiddle}
	The two properties hold if $r \in rs$:
	\begin{itemize}
		\item
			$\rdistinct{rs}{rset} = \rdistinct{(rs @ [r])}{rset}$\\
			and\\
			$\rdistinct{(ab :: rs @ [ab])}{rset'} = \rdistinct{(ab :: rs)}{rset'}$
		\item
			$\rdistinct{ (rs @ rs') }{rset} = \rdistinct{rs @ [r] @ rs'}{rset}$\\
			and\\
			$\rdistinct{(ab :: rs @ [ab] @ rs'')}{rset'} = 
			\rdistinct{(ab :: rs @ rs'')}{rset'}$
	\end{itemize}
\end{lemma}
\noindent
\begin{proof}
	By induction on $rs$. All other variables are allowed to be arbitrary.
	The second part of the lemma requires the first.
	Note that for each part, the two sub-propositions need to be proven 
	at the same time,
	so that the induction goes through.
\end{proof}
\noindent
This allows us to prove a few more equivalence relations involving 
$\textit{rdistinct}$ (they will be useful later):
\begin{lemma}\label{rdistinctConcatGeneral}
	\mbox{}
	\begin{itemize}
		\item
			$\rdistinct{(rs @ rs')}{\varnothing} = \rdistinct{((\rdistinct{rs}{\varnothing})@ rs')}{\varnothing}$
		\item
			$\rdistinct{(rs @ rs')}{\varnothing} = \rdistinct{(\rdistinct{rs}{\varnothing} @ rs')}{\varnothing}$
		\item
			If $rset' \subseteq rset$, then $\rdistinct{rs}{rset} = 
			\rdistinct{(\rdistinct{rs}{rset'})}{rset}$. As a corollary
			of this,
		\item
			$\rdistinct{(rs @ rs')}{rset} = \rdistinct{
			(\rdistinct{rs}{\varnothing}) @ rs')}{rset}$. This
			gives another corollary use later:
		\item
			If $a \in rset$, then $\rdistinct{(rs @ rs')}{rset} = \rdistinct{
			(\rdistinct{(a :: rs)}{\varnothing} @ rs')}{rset} $,

	\end{itemize}
\end{lemma}
\begin{proof}
	By \ref{rdistinctDoesTheJob} and \ref{distinctRemovesMiddle}.
\end{proof}
\noindent
The next lemma is a more general form of \ref{rdistinctConcat},
it says that
$\textit{rdistinct}$ is composable w.r.t list concatenation:
\begin{lemma}\label{distinctRdistinctAppend}
			If $\;\; \textit{isDistinct} \; rs_1$, 
			and $(set \; rs_1) \cap acc = \varnothing$,
			then applying $\textit{rdistinct}$ on $rs_1 @ rs_a$ does not 
			have an effect on $rs_1$:
			\[\textit{rdistinct}\;  (rs_1 @ rsa)\;\, acc
			= rs_1@(\textit{rdistinct} rsa \; (acc \cup rs_1))\]
\end{lemma}
\begin{proof}
	By an induction on 
	$rs_1$, where $rsa$ and $acc$ are allowed to be arbitrary.
\end{proof}
\noindent
$\textit{rdistinct}$ needs to be applied only once, and 
applying it multiple times does not make any difference:
\begin{corollary}\label{distinctOnceEnough}
	$\textit{rdistinct} \; (rs @ rsa) {} = \textit{rdistinct} \; (rdistinct \; 
	rs \{ \} @ (\textit{rdistinct} \; rs_a \; (set \; rs)))$
\end{corollary}
\begin{proof}
	By lemma \ref{distinctRdistinctAppend}.
\end{proof}

\subsubsection{The Properties of $\textit{Rflts}$} 
We give in this subsection some properties
involving $\backslash_r$, $\backslash_{rsimps}$, $\textit{rflts}$ and 
$\textit{rsimp}_{ALTS} $, together with any non-trivial lemmas that lead to them.
These will be helpful in later closed form proofs, when
we want to transform derivative terms which have
%the ways in which multiple functions involving
%those are composed together
interleaving derivatives and  simplifications applied to them.

\noindent
%When the function $\textit{Rflts}$ 
%is applied to the concatenation of two lists, the output can be calculated by first applying the
%functions on two lists separately, and then concatenating them together.
$\textit{Rflts}$ is composable in terms of concatenation:
\begin{lemma}\label{rfltsProps}
	The function $\rflts$ has the properties below:\\
	\begin{itemize}
		\item
			$\rflts \; (rs_1 @ rs_2) = \rflts \; rs_1 @ \rflts \; rs_2$
		\item
			If $r \neq \RZERO$ and $\nexists rs_1. r = \RALTS{rs}_1$, then $\rflts \; (r::rs) = r :: \rflts \; rs$
		\item
			$\rflts \; (rs @ [\RZERO]) = \rflts \; rs$
		\item
			$\rflts \; (rs' @ [\RALTS{rs}]) = \rflts \; rs'@rs$
		\item
			$\rflts \; (rs @ [\RONE]) = \rflts \; rs @ [\RONE]$
		\item
			If $r \neq \RZERO$ and $\nexists rs'. r = \RALTS{rs'}$ then $\rflts \; (rs @ [r])
			= (\rflts \; rs) @ [r]$
		\item
			If $r = \RALTS{rs}$ and $r \in rs'$ then for all $r_1 \in rs. 
			r_1 \in \rflts \; rs'$.
		\item
			$\rflts \; (rs_a @ \RZERO :: rs_b) = \rflts \; (rs_a @ rs_b)$
	\end{itemize}
\end{lemma}
\noindent
\begin{proof}
	By induction on $rs_1$ in the first sub-lemma, and induction on $r$ in the second part,
	and induction on $rs$, $rs'$, $rs$, $rs'$, $rs_a$ in the third, fourth, fifth, sixth and 
	last sub-lemma.
\end{proof}
\noindent
Now we introduce the property that the operations 
derivative and $\rsimpalts$
commute, this will be used later on, when deriving the closed form for
the alternative regular expression:
\begin{lemma}\label{rderRsimpAltsCommute}
	$\rder{x}{(\rsimpalts \; rs)} = \rsimpalts \; (\map \; (\rder{x}{\_}) \; rs)$
\end{lemma}
\begin{proof}
	By induction on $rs$.
\end{proof}
\noindent

\subsubsection{The $RL$ Function: Language Interpretation of $\textit{Rrexp}$s}
Much like the definition of $L$ on plain regular expressions, one can also 
define the language interpretation of $\rrexp$s.
\begin{center}
	\begin{tabular}{lcl}
		$RL \; (\ZERO)$ & $\dn$ & $\phi$\\
		$RL \; (\ONE)$ & $\dn$ & $\{[]\}$\\
		$RL \; (c)$ & $\dn$ & $\{[c]\}$\\
		$RL \; \sum rs$ & $\dn$ & $ \bigcup_{r \in rs} (RL \; r)$\\
		$RL \; (r_1 \cdot r_2)$ & $\dn$ & $ RL \; (r_1) @ RL \; (r_2)$\\
		$RL \; (r^*)$ & $\dn$ & $ (RL(r))^*$
	\end{tabular}
\end{center}
\noindent
The main use of $RL$ is to establish some connections between $\rsimp{}$ 
and $\rnullable{}$:
\begin{lemma}
	The following properties hold:
	\begin{itemize}
		\item
			If $\rnullable{r}$, then $\rsimp{r} \neq \RZERO$.
		\item
			$\rnullable{r \backslash s} \quad $ if and only if $\quad \rnullable{\rderssimp{r}{s}}$.
	\end{itemize}
\end{lemma}
\begin{proof}
	The first part is by induction on $r$. 
	The second part is true because property 
	\[ RL \; r = RL \; (\rsimp{r})\] holds.
\end{proof}

\subsubsection{Simplified $\textit{Rrexp}$s are Good}
We formalise the notion of ``good" regular expressions,
which means regular expressions that
are fully simplified in terms of our $\textit{rsimp}$ function. 
For alternative regular expressions that means they
do not contain any nested alternatives, un-eliminated $\RZERO$s
or duplicate elements (for example, 
$r_1 + (r_2 + r_3)$, $\RZERO + r$ and $ \sum [r, r, \ldots]$).
The clauses for $\good$ are:
\begin{center}
	\begin{tabular}{@{}lcl@{}}
		$\good\; \RZERO$ & $\dn$ & $\textit{false}$\\
		$\good\; \RONE$ & $\dn$ & $\textit{true}$\\
		$\good\; \RCHAR{c}$ & $\dn$ & $\btrue$\\
		$\good\; \RALTS{[]}$ & $\dn$ & $\bfalse$\\
		$\good\; \RALTS{[r]}$ & $\dn$ & $\bfalse$\\
		$\good\; \RALTS{r_1 :: r_2 :: rs}$ & $\dn$ & 
		$\textit{isDistinct} \; (r_1 :: r_2 :: rs) \;$\\
						   & & $\land \; (\forall r' \in (r_1 :: r_2 :: rs).\; \good \; r'\; \,  \land \; \, \textit{nonAlt}\; r')$\\
		$\good \; \RSEQ{\RZERO}{r}$ & $\dn$ & $\bfalse$\\
		$\good \; \RSEQ{\RONE}{r}$ & $\dn$ & $\bfalse$\\
		$\good \; \RSEQ{r}{\RZERO}$ & $\dn$ & $\bfalse$\\
		$\good \; \RSEQ{r_1}{r_2}$ & $\dn$ & $\good \; r_1 \;\, \textit{and} \;\, \good \; r_2$\\
		$\good \; \RSTAR{r}$ & $\dn$ & $\btrue$\\
	\end{tabular}
\end{center}
\noindent
We omit the recursive definition of the predicate $\textit{nonAlt}$,
which evaluates to true when the regular expression is not an
alternative, and false otherwise.
The $\good$ property is preserved under $\rsimp_{ALTS}$, provided that
its non-empty argument list of expressions are all good themsleves, and $\textit{nonAlt}$, 
and unique:
\begin{lemma}\label{rsimpaltsGood}
	If $rs \neq []$ and forall $r \in rs. \textit{nonAlt} \; r$ and $\textit{isDistinct} \; rs$,
	then $\good \; (\rsimpalts \; rs)$ if and only if forall $r \in rs. \; \good \; r$.
\end{lemma}
\noindent
We also note that
if a regular expression $r$ is good, then $\rflts$ on the singleton
list $[r]$ will not break goodness:
\begin{lemma}\label{flts2}
	If $\good \; r$, then forall $r' \in \rflts \; [r]. \; \good \; r'$ and $\textit{nonAlt} \; r'$.
\end{lemma}
\begin{proof}
	By an induction on $r$.
\end{proof}
\noindent
The other observation we make about $\rsimp{r}$ is that it never
comes with nested alternatives, which we describe as the $\nonnested$
property:
\begin{center}
	\begin{tabular}{lcl}
		$\nonnested \; \, \sum []$ & $\dn$ & $\btrue$\\
		$\nonnested \; \, \sum ((\sum rs_1) :: rs_2)$ & $\dn$ & $\bfalse$\\
		$\nonnested \; \, \sum (r :: rs)$ & $\dn$ & $\nonnested (\sum rs)$\\
		$\nonnested \; \, r $ & $\dn$ & $\btrue$
	\end{tabular}
\end{center}
\noindent
The $\rflts$ function
always opens up nested alternatives,
which enables $\rsimp$ to be non-nested:

\begin{lemma}\label{nonnestedRsimp}
	It is always the case that
	\begin{center}
		$\nonnested \; (\rsimp{r})$
	\end{center}
\end{lemma}
\begin{proof}
	By induction on $r$.
\end{proof}
\noindent
With this we could prove that a regular expressions
after simplification and flattening and de-duplication,
will not contain any alternative regular expression directly:
\begin{lemma}\label{nonaltFltsRd}
	If $x \in \rdistinct{\rflts\; (\map \; \rsimp{} \; rs)}{\varnothing}$ 
	then $\textit{nonAlt} \; x$.
\end{lemma}
\begin{proof}
	By \ref{nonnestedRsimp}.
\end{proof}
\noindent
The other fact we know is that once $\rsimp{}$ has finished
processing an alternative regular expression, it will not
contain any $\RZERO$s. This is because all the recursive 
calls to the simplification on the children regular expressions
make the children good, and $\rflts$ will not delete
any $\RZERO$s out of a good regular expression list,
and $\rdistinct{}$ will not ``mess'' with the result.
\begin{lemma}\label{flts3Obv}
	The following are true:
	\begin{itemize}
		\item
			If for all $r \in rs. \, \good \; r $ or $r = \RZERO$,
			then for all $r \in \rflts\; rs. \, \good \; r$.
		\item
			If $x \in \rdistinct{\rflts\; (\map \; rsimp{}\; rs)}{\varnothing}$
			and for all $y$ such that $\llbracket y \rrbracket_r$ less than
			$\llbracket rs \rrbracket_r + 1$, either
			$\good \; (\rsimp{y})$ or $\rsimp{y} = \RZERO$,
			then $\good \; x$.
	\end{itemize}
\end{lemma}
\begin{proof}
	The first part is by induction on $rs$, where the induction
	rule is the inductive cases for $\rflts$.
	The second part is a corollary from the first part.
\end{proof}

And this leads to good structural property of $\rsimp{}$,
that after simplification, a regular expression is
either good or $\RZERO$:
\begin{lemma}\label{good1}
	For any r-regular expression $r$, $\good \; \rsimp{r}$ or $\rsimp{r} = \RZERO$.
\end{lemma}
\begin{proof}
	By an induction on $r$. The inductive measure is the size $\llbracket \rrbracket_r$.
	Lemma \ref{rsimpMono} says that 
	$\llbracket \rsimp{r}\rrbracket_r$ is smaller than or equal to
	$\llbracket r \rrbracket_r$.
	Therefore, in the $r_1 \cdot r_2$ and $\sum rs$ case,
	Inductive hypothesis applies to the children regular expressions
	$r_1$, $r_2$, etc. The lemma \ref{flts3Obv}'s precondition is satisfied
	by that as well.
	The lemmas \ref{nonnestedRsimp} and  \ref{nonaltFltsRd} are used
	to ensure that goodness is preserved at the topmost level.
\end{proof}
We shall prove that any good regular expression is 
a fixed-point for $\rsimp{}$.
First we prove an auxiliary lemma:
\begin{lemma}\label{goodaltsNonalt}
	If $\good \; \sum rs$, then $\rflts\; rs = rs$.
\end{lemma}
\begin{proof}
	By an induction on $\sum rs$. The inductive rules are the cases
	for $\good$.
\end{proof}
\noindent
Now we are ready to prove that good regular expressions are invariant
with respect to $\rsimp{}$:
\begin{lemma}\label{test}
	If $\good \;r$ then $\rsimp{r} = r$.
\end{lemma}
\begin{proof}
	By an induction on the inductive cases of $\good$, using lemmas
	\ref{goodaltsNonalt} and \ref{rdistinctOnDistinct}.
	The lemma \ref{goodaltsNonalt} is used in the alternative
	case where 2 or more elements are present in the list.
\end{proof}
\noindent
Given below is a property involving $\rflts$, $\textit{rdistinct}$, 
$\rsimp{}$ and $\rsimp_{ALTS}$,
which requires $\ref{good1}$ to go through smoothly:
\begin{lemma}\label{flattenRsimpalts}
An application of $\rsimp_{ALTS}$ can be ``absorbed'',
if its output is concatenated with a list and then applied to $\rflts$.
\begin{center}
	$\rflts \; ( (\rsimp_{ALTS} \; 
	(\rdistinct{(\rflts \; (\map \; \rsimp{}\; rs))}{\varnothing})) :: 
	\map \; \rsimp{} \; rs' ) = 
	\rflts \; ( (\rdistinct{(\rflts \; (\map \; \rsimp{}\; rs))}{\varnothing}) @ (
	\map \; \rsimp{rs'}))$
\end{center}


\end{lemma}
\begin{proof}
	By \ref{good1}.
\end{proof}
\noindent





We are also ready to prove that $\textit{rsimp}$ is idempotent.
\subsubsection{$\rsimp$ is Idempotent}
The idempotency of $\rsimp$ is very useful in 
manipulating regular expression terms into desired
forms so that key steps allowing further rewriting to closed forms
are possible.
\begin{lemma}\label{rsimpIdem}
	$\rsimp{r} = \rsimp{(\rsimp{r})}$
\end{lemma}

\begin{proof}
	By \ref{test} and \ref{good1}.
\end{proof}
\noindent
This property means we do not have to repeatedly
apply simplification in each step, which justifies
our definition of $\blexersimp$.


On the other hand, we can repeat the same $\rsimp{}$ applications
on regular expressions as many times as we want, if we have at least
one simplification applied to it, and apply it wherever we would like to:
\begin{corollary}\label{headOneMoreSimp}
	The following properties hold, directly from \ref{rsimpIdem}:

	\begin{itemize}
		\item
			$\map \; \rsimp{(r :: rs)} = \map \; \rsimp{} \; (\rsimp{r} :: rs)$
		\item
			$\rsimp{(\RALTS{rs})} = \rsimp{(\RALTS{\map \; \rsimp{} \; rs})}$
	\end{itemize}
\end{corollary}
\noindent
This will be useful in the later closed form proof's rewriting steps.
Similarly, we state the following useful facts below:
\begin{lemma}
	The following equalities hold if $r = \rsimp{r'}$ for some $r'$:
	\begin{itemize}
		\item
			If $r  = \sum rs$ then $\rsimpalts \; rs = \sum rs$.
		\item
			If $r = \sum rs$ then $\rdistinct{rs}{\varnothing} = rs$.
		\item
			$\rsimpalts \; (\rdistinct{\rflts \; [r]}{\varnothing}) = r$.
	\end{itemize}
\end{lemma}
\begin{proof}
	By application of lemmas \ref{rsimpIdem} and \ref{good1}.
\end{proof}

\noindent
With the idempotency of $\rsimp{}$ and its corollaries, 
we can start proving some key equalities leading to the 
closed forms.
Now presented are a few equivalent terms under $\rsimp{}$.
To make the notation more concise
We use $r_1 \sequal r_2 $ to denote that $\rsimp{r_1} = \rsimp{r_2}$:
\begin{center}
\begin{tabular}{lcl}
	$a \sequal b$ & $ \dn$ & $ \textit{rsimp} \; a = \textit{rsimp} \; b$
\end{tabular}
\end{center}
\noindent
%\vspace{0em}
\begin{lemma}
	The following equivalence hold:
	\begin{itemize}
	\item
		$\rsimpalts \; (\RZERO :: rs) \sequal \rsimpalts\; rs$
	\item
		$\rsimpalts \; rs \sequal \rsimpalts (\map \; \rsimp{} \; rs)$
	\item
		$\RALTS{\RALTS{rs}} \sequal \RALTS{rs}$
	\item
		$\sum ((\sum rs_a) :: rs_b) \sequal \sum rs_a @ rs_b$
	\item
		$\RALTS{rs} = \RALTS{\map \; \rsimp{} \; rs}$
\end{itemize}
\end{lemma}
\begin{proof}
	By induction on the lists involved.
\end{proof}
\noindent
The above allows us to prove
two similar equalities (which are a bit more involved).
It says that we could flatten the elements
before simplification and still get the same result.
\begin{lemma}\label{simpFlatten3}
	One can flatten the inside $\sum$ of a $\sum$ if it is being 
	simplified. Concretely,
	\begin{itemize}
		\item
			If for all $r \in rs, rs', rs''$, we have $\good \; r $
			or $r = \RZERO$, then $\sum (rs' @ rs @ rs'') \sequal 
			\sum (rs' @ [\sum rs] @ rs'')$ holds. As a corollary,
		\item
			$\sum (rs' @ [\sum rs] @ rs'') \sequal \sum (rs' @ rs @ rs'')$
	\end{itemize}
\end{lemma}
\begin{proof}
	By rewriting steps involving the use of \ref{test} and \ref{rdistinctConcatGeneral}.
	The second sub-lemma is a corollary of the previous.
\end{proof}
%Rewriting steps not put in--too long and complicated-------------------------------
\begin{comment}
	\begin{center}
		$\rsimp{\sum (rs' @ rs @ rs'')}  \stackrel{def of bsimp}{=}$  \\
		$\rsimpalts \; (\rdistinct{\rflts \; ((\map \; \rsimp{}\; rs') @ (\map \; \rsimp{} \; rs ) @ (\map \; \rsimp{} \; rs''))}{\varnothing})$ \\
		$\stackrel{by \ref{test}}{=} 
		\rsimpalts \; (\rdistinct{(\rflts \; rs' @ \rflts \; rs @ \rflts \; rs'')}{
		\varnothing})$\\
		$\stackrel{by \ref{rdistinctConcatGeneral}}{=}
		\rsimpalts \; (\rdistinct{\rflts \; rs'}{\varnothing} @ \rdistinct{(
		\rflts\; rs @ \rflts \; rs'')}{\rflts \; rs'})$\\

	\end{center}
\end{comment}
%Rewriting steps not put in--too long and complicated-------------------------------
\noindent


We need more equalities like the above to enable a closed form lemma,
for which we need to introduce a few rewrite relations
to help
us obtain them.

\subsection{The rewrite relation $\hrewrite$ , $\scfrewrites$ , $\frewrite$ and $\grewrite$}
Inspired by the success we had in the correctness proof 
in \ref{Bitcoded2},
we follow suit here, defining atomic simplification
steps as ``small-step'' rewriting steps. This allows capturing 
similarities between terms that would be otherwise
hard to express.

We use $\hrewrite$ for one-step atomic rewrite of 
regular expression simplification, 
$\frewrite$ for rewrite of list of regular expressions that 
include all operations carried out in $\rflts$, and $\grewrite$ for
rewriting a list of regular expressions possible in both $\rflts$ and $\textit{rdistinct}$.
Their reflexive transitive closures are used to denote zero or many steps,
as was the case in the previous chapter.
As we have already
done something similar, the presentation about
these rewriting rules will be more concise than that in \ref{Bitcoded2}.
To differentiate between the rewriting steps for annotated regular expressions
and $\rrexp$s, we add characters $h$ and $g$ below the squig arrow symbol
to mean atomic simplification transitions 
of $\rrexp$s and $\rrexp$ lists, respectively.




\begin{figure}[H]
\begin{center}
	\begin{mathpar}
		\inferrule[RSEQ0L]{}{\RZERO \cdot r_2 \hrewrite \RZERO\\}

		\inferrule[RSEQ0R]{}{r_1 \cdot \RZERO \hrewrite \RZERO\\}

		\inferrule[RSEQ1]{}{(\RONE \cdot r) \hrewrite  r\\}\\	

		\inferrule[RSEQL]{ r_1 \hrewrite r_2}{r_1 \cdot r_3 \hrewrite r_2 \cdot r_3\\}

		\inferrule[RSEQR]{ r_3 \hrewrite r_4}{r_1 \cdot r_3 \hrewrite r_1 \cdot r_4\\}\\

		\inferrule[RALTSChild]{r \hrewrite r'}{\sum (rs_1 @ [r] @ rs_2) \hrewrite \sum (rs_1 @ [r'] @ rs_2)\\}

		\inferrule[RALTS0]{}{\sum (rs_a @ [\RZERO] @ rs_b) \hrewrite \sum (rs_a @ rs_b)}

		\inferrule[RALTSNested]{}{\sum (rs_a @ [\sum rs_1] @ rs_b) \hrewrite \sum (rs_a @ rs_1 @ rs_b)}

		\inferrule[RALTSNil]{}{ \sum [] \hrewrite \RZERO\\}

		\inferrule[RALTSSingle]{}{ \sum [r] \hrewrite  r\\}	

		\inferrule[RALTSDelete]{\\ r_1 = r_2}{\sum rs_a @ [r_1] @ rs_b @ [r_2] @ rsc \hrewrite \sum rs_a @ [r_1] @ rs_b @ rs_c}

	\end{mathpar}
\end{center}
\caption{List of one-step rewrite rules for r-regular expressions ($\hrewrite$)}\label{hRewrite}
\end{figure}


Like $\rightsquigarrow_s$, it is
convenient to define rewrite rules for a list of regular expressions,
where each element can rewrite in many steps to the other (scf stands for
li\emph{s}t \emph{c}losed \emph{f}orm). This relation is similar to the 
$\stackrel{s*}{\rightsquigarrow}$ for annotated regular expressions.

\begin{figure}[H]
\begin{center}
	\begin{mathpar}
		\inferrule{}{[] \scfrewrites [] }

		\inferrule{r \hrewrites r' \\ rs \scfrewrites rs'}{r :: rs \scfrewrites r' :: rs'}
	\end{mathpar}
\end{center}
\caption{List of one-step rewrite rules for a list of r-regular expressions}\label{scfRewrite}
\end{figure}
%frewrite
List of one-step rewrite rules for flattening 
a list of  regular expressions($\frewrite$):
\begin{figure}[H]
\begin{center}
	\begin{mathpar}
		\inferrule{}{\RZERO :: rs \frewrite rs \\}

		\inferrule{}{(\sum rs) :: rs_a \frewrite rs @ rs_a \\}

		\inferrule{rs_1 \frewrite rs_2}{r :: rs_1 \frewrite r :: rs_2}
	\end{mathpar}
\end{center}
\caption{List of one-step rewrite rules characterising the $\rflts$ operation on a list}\label{fRewrites}
\end{figure}

Lists of one-step rewrite rules for flattening and de-duplicating
a list of regular expressions ($\grewrite$):
\begin{figure}[H]
\begin{center}
	\begin{mathpar}
		\inferrule{}{\RZERO :: rs \grewrite rs \\}

		\inferrule{}{(\sum rs) :: rs_a \grewrite rs @ rs_a \\}

		\inferrule{rs_1 \grewrite rs_2}{r :: rs_1 \grewrite r :: rs_2}

		\inferrule[dB]{}{rs_a @ [a] @ rs_b @[a] @ rs_c \grewrite rs_a @ [a] @ rsb @ rsc}
	\end{mathpar}
\end{center}
\caption{List of one-step rewrite rules characterising the $\rflts$ and $\textit{rdistinct}$
operations}\label{gRewrite}
\end{figure}
\noindent
We define
two separate list rewriting relations $\frewrite$ and $\grewrite$.
The rewriting steps that take place during
flattening is characterised by $\frewrite$.
The rewrite relation $\grewrite$ characterises both flattening and de-duplicating.
Sometimes $\grewrites$ is slightly too powerful
so we would rather use $\frewrites$ to prove
%because we only
equalities related to $\rflts$.
%certain equivalence under the rewriting steps of $\frewrites$.
For example, when proving the closed-form for the alternative regular expression,
one of the equalities needed is:
\begin{center}
	$\sum (\rDistinct \;\; (\map \; (\_ \backslash x) \; (\rflts \; rs)) \;\; \varnothing) \sequal
	\sum (\rDistinct \;\;  (\rflts \; (\map \; (\_ \backslash x) \; rs)) \;\; \varnothing)
	$
\end{center}
\noindent
Proving this is by first showing 
\begin{lemma}\label{earlyLaterDerFrewrites}
	$\map \; (\_ \backslash x) \;  (\rflts \; rs) \frewrites
	\rflts \; (\map \; (\_ \backslash x) \; rs)$
\end{lemma}
\noindent
and then the equivalence between two terms
that can reduce in many steps to each other:
\begin{lemma}\label{frewritesSimpeq}
	If $rs_1 \frewrites rs_2 $, then $\sum (\rDistinct \; rs_1 \; \varnothing) \sequal 
	\sum (\rDistinct \;  rs_2 \;  \varnothing)$.
\end{lemma}
\noindent
These two lemmas can both be proven using a straightforward induction (and
the proofs for them are therefore omitted).

Now the above equalities can be derived like a breeze:
\begin{corollary}
	$\sum (\rDistinct \;\; (\map \; (\_ \backslash x) \; (\rflts \; rs)) \;\; \varnothing) \sequal
	\sum (\rDistinct \;\;  (\rflts \; (\map \; (\_ \backslash x) \; rs)) \;\; \varnothing)
	$
\end{corollary}
\begin{proof}
	By lemmas \ref{earlyLaterDerFrewrites} and \ref{frewritesSimpeq}.
\end{proof}
But this trick will not work for $\grewrites$.
For example, a rewriting step in proving
closed forms is:
\begin{center}
	$\rsimp{(\rsimpalts \; (\map \; (\_ \backslash x) \; (\rdistinct{(\rflts \; (\map \; (\rsimp{} \; \circ \; (\lambda r. \rderssimp{r}{xs}))))}{\varnothing})))}$\\
	$=$ \\
	$\rsimp{(\rsimpalts \; (\rdistinct{(\map \; (\_ \backslash x) \; (\rflts \; (\map \; (\rsimp{} \; \circ \; (\lambda r. \rderssimp{r}{xs})))) ) }{\varnothing}))} $
	\noindent
\end{center}
For this, one would hope to have a rewriting relation between the two lists involved,
similar to \ref{earlyLaterDerFrewrites}. However, it turns out that 
\begin{center}
	$\map \; (\_ \backslash x) \; (\rDistinct \; rs \; rset) \grewrites \rDistinct \; (\map \;
	(\_ \backslash x) \; rs) \; ( rset \backslash x)$
\end{center}
\noindent
does $\mathbf{not}$ hold in general.
For this rewriting step we will introduce some slightly more cumbersome
proof technique later.
The point is that $\frewrite$
allows us to prove equivalence in a straightforward way that is 
not possible for $\grewrite$. 


\subsubsection{Terms That Can Be Rewritten Using $\hrewrites$, $\grewrites$, and $\frewrites$}
In this part, we present lemmas stating
pairs of r-regular expressions and r-regular expression lists
where one could rewrite from one in many steps to the other.
Most of the proofs to these lemmas are straightforward, using
an induction on the corresponding rewriting relations.
These proofs will therefore be omitted when this is the case.
We present in the following lemma a few pairs of terms that are rewritable via 
$\grewrites$:
\begin{lemma}\label{gstarRdistinctGeneral}
	\mbox{}
	\begin{itemize}
		\item
			$rs_1 @ rs \grewrites rs_1 @ (\rDistinct \; rs \; rs_1)$
		\item
			$rs \grewrites \rDistinct \; rs \; \varnothing$
		\item
			$rs_a @ (\rDistinct \; rs \; rs_a) \grewrites rs_a @ (\rDistinct \; 
			rs \; (\{\RZERO\} \cup rs_a))$
		\item
			$rs \;\; @ \;\; \rDistinct \; rs_a \; rset \grewrites rs @  \rDistinct \; rs_a \;
			(rest \cup rs)$

	\end{itemize}
\end{lemma}
\noindent
If a pair of terms $rs_1, rs_2$ are rewritable via $\grewrites$ to each other,
then they are equivalent under $\rsimp{}$:
\begin{lemma}\label{grewritesSimpalts}
	\mbox{}
	If $rs_1 \grewrites rs_2$, then
	we have the following equivalence:
	\begin{itemize}
		\item
			$\sum rs_1 \sequal \sum rs_2$
		\item
			$\rsimpalts \; rs_1 \sequal \rsimpalts \; rs_2$
	\end{itemize}
\end{lemma}
\noindent
Here are a few connecting lemmas showing that
if a list of regular expressions can be rewritten using $\grewrites$ or $\frewrites $ or
$\scfrewrites$,
then an alternative constructor taking the list can also be rewritten using $\hrewrites$:
\begin{lemma}
	\begin{itemize}
		\item
			If $rs \grewrites rs'$ then $\sum rs \hrewrites \sum rs'$.
		\item
			If $rs \grewrites rs'$ then $\sum rs \hrewrites \rsimpalts \; rs'$
		\item
			If $rs_1 \scfrewrites rs_2$ then $\sum (rs @ rs_1) \hrewrites \sum (rs @ rs_2)$
		\item
			If $rs_1 \scfrewrites rs_2$ then $\sum rs_1 \hrewrites \sum rs_2$

	\end{itemize}
\end{lemma}
\noindent
Now comes the core of the proof, 
which says that once two lists are rewritable to each other,
then they are equivalent under $\rsimp{}$:
\begin{lemma}
	If $r_1 \hrewrites r_2$ then $r_1 \sequal r_2$.
\end{lemma}

\noindent
Similar to what we did in \ref{Bitcoded2}, 
we prove that if one can rewrite from one r-regular expression ($r$)
to the other ($r'$), after taking derivatives one could still rewrite
the first ($r\backslash c$) to the other ($r'\backslash c$).
\begin{lemma}\label{interleave}
	If $r \hrewrites r' $ then $\rder{c}{r} \hrewrites \rder{c}{r'}$
\end{lemma}
\noindent
This allows us proving more $\mathbf{rsimp}$-equivalent terms, involving $\backslash_r$.
\begin{lemma}\label{insideSimpRemoval}
	$\rsimp{(\rder{c}{(\rsimp{r})})} = \rsimp{(\rder{c}{r})}  $
\end{lemma}
\noindent
\begin{proof}
	By \ref{interleave} and \ref{rsimpIdem}.
\end{proof}
\noindent
And this unlocks more equivalent terms:
\begin{lemma}\label{Simpders}
	As corollaries of \ref{insideSimpRemoval}, we have
	\begin{itemize}
		\item
			If $s \neq []$ then $\rderssimp{r}{s} = \rsimp{( r \backslash_{rs} s)}$.
		\item
			$\rsimpalts \; (\map \; (\_ \backslash_r x) \;
			(\rdistinct{rs}{\varnothing})) \sequal
			\rsimpalts \; (\rDistinct \; 
			(\map \; (\_ \backslash_r x) rs) \;\varnothing  )$
	\end{itemize}
\end{lemma}
\begin{proof}
	Part 1 is by lemma \ref{insideSimpRemoval},
	part 2 is by lemma \ref{insideSimpRemoval} .%and \ref{distinctDer}.
\end{proof}
\noindent

\subsection{Closed Forms for $\sum rs$, $r_1\cdot r_2$ and $r^*$}
Lemma \ref{Simpders} leads to our first closed form,
which is for the alternative regular expression:
\begin{theorem}\label{altsClosedForm}
	\mbox{}
	\begin{center}
		$\rderssimp{(\sum rs)}{s} \sequal
		\sum \; (\map \; (\rderssimp{\_}{s}) \; rs)$
	\end{center}
\end{theorem}
\noindent
\begin{proof}
	By a reverse induction on the string $s$.
	One rewriting step, as we mentioned earlier,
	involves
	\begin{center}
		$\rsimpalts \; (\map \; (\_ \backslash x) \; 
		(\rdistinct{(\rflts \; (\map \; (\rsimp{} \; \circ \; 
		(\lambda r. \rderssimp{r}{xs}))))}{\varnothing}))
		\sequal
		\rsimpalts \; (\rdistinct{(\map \; (\_ \backslash x) \; 
			(\rflts \; (\map \; (\rsimp{} \; \circ \; 
		(\lambda r. \rderssimp{r}{xs})))) ) }{\varnothing}) $.
	\end{center}
	This can be proven by a combination of 
	\ref{grewritesSimpalts}, \ref{gstarRdistinctGeneral}, \ref{rderRsimpAltsCommute}, and
	\ref{insideSimpRemoval}.
\end{proof}
\noindent
This closed form has a variant which can be more convenient in later proofs:
\begin{corollary}\label{altsClosedForm1}
	If $s \neq []$ then 
	$\rderssimp \; (\sum \; rs) \; s = 
	\rsimp{(\sum \; (\map \; \rderssimp{\_}{s} \; rs))}$.
\end{corollary}
\noindent
The harder closed forms are the sequence and star ones.
Before we obtain them, some preliminary definitions
are needed to make proof statements concise.


\subsubsection{Closed Form for Sequence Regular Expressions}
For the sequence regular expression,
let's first look at a series of derivative steps on it 
(assuming that each time when a derivative is taken,
the head of the sequence is always nullable):
\begin{center}
	\begin{tabular}{llll}
		$r_1 \cdot r_2$ &  
		$\longrightarrow_{\backslash c}$ &  
		$r_1\backslash c \cdot r_2 + r_2 \backslash c$ &
		$ \longrightarrow_{\backslash c'} $ \\ 
		\\
		$(r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc'$ & 
		$\longrightarrow_{\backslash c''} $ &
		$((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') 
		+ r_2 \backslash cc'c''$ & 
		$   \longrightarrow_{\backslash c''} \quad \ldots$\\
	\end{tabular}
\end{center}
Roughly speaking $r_1 \cdot r_2 \backslash s$ can be expresssed as 
a giant alternative taking a list of terms 
$[r_1 \backslash_r s \cdot r_2, r_2 \backslash_r s'', r_2 \backslash_r s_1'', \ldots]$,
where the head of the list is always the term
representing a match involving only $r_1$, and the tail of the list consisting of
terms of the shape $r_2 \backslash_r s''$, $s''$ being a suffix of $s$.
This intuition is also echoed by Murugesan and Sundaram \cite{Murugesan2014}, 
where they gave
a pencil-and-paper derivation of $(r_1 \cdot r_2)\backslash s$:
\begin{center}
	\begin{tabular}{lc}
		$L \; [ (r_1 \cdot r_2) \backslash_r (c_1 :: c_2 :: \ldots c_n) ]$ & $ =$\\
		\\
		\rule{0pt}{3ex} $L \; [ ((r_1 \backslash_r c_1) \cdot r_2 + 
		(\delta\; (\nullable \; r_1) \; (r_2 \backslash_r c_1) )) \backslash_r 
		(c_2 :: \ldots c_n) ]$ &
		$=$\\
		\\
		\rule{0pt}{3ex} $L \; [ ((r_1 \backslash_r c_1c_2 \cdot r_2 + 
		(\delta \; (\nullable \; r_1) \; (r_2 \backslash_r c_1c_2)))
		$ & \\
		\\
		$\quad + (\delta \ (\nullable \; r_1 \backslash_r c)\; (r_2 \backslash_r c_2) )) 
		\backslash_r (c_3 \ldots c_n) ]$ & $\ldots$ \\
	\end{tabular}
\end{center}
\noindent
The $\delta$ function 
returns $r$ when the boolean condition
$b$ evaluates to true and
$\ZERO$ otherwise:
\begin{center}
	\begin{tabular}{lcl}
		$\delta \; b\; r$ & $\dn$ & $r \quad \textit{if} \; b \; is \;\textit{true}$\\
				  & $\dn$ & $\ZERO \quad otherwise$
	\end{tabular}
\end{center}
\noindent
Note that the term
\begin{center}
	\begin{tabular}{lc}
		\rule{0pt}{3ex} $((r_1 \backslash_r c_1c_2 \cdot r_2 + 
		(\delta \; (\nullable \; r_1) \; (r_2 \backslash_r c_1c_2)))
		$ & \\
		\\
		$\quad + (\delta \ (\nullable \; r_1 \backslash_r c)\; (r_2 \backslash_r c_2) )) 
		\backslash_r (c_3 \ldots c_n)$ &\\
	\end{tabular}
\end{center}
\noindent
does not faithfully
represent what the intermediate derivatives would actually look like
when one or more intermediate results $r_1 \backslash s' \cdot r_2$ are not 
nullable in the head of the sequence.
For example, when $r_1$ and $r_1 \backslash_r c_1$ are not nullable,
the regular expression would not look like 
\[
	r_1 \backslash_r c_1c_2
\]
instead of
\[
	(r_1 \backslash_r c_1c_2 + \ZERO ) + \ZERO.
\]
The redundant $\ZERO$s will not be created in the
first place.
In a closed-form one needs to take into account this (because
closed forms require exact equality rather than language equivalence)
and only generate the 
$r_2 \backslash_r s''$ terms satisfying the property
\begin{center}
$\exists  s'.  such \; that \; s'@s'' = s \;\; \land \;\;
r_1 \backslash s' \; is \; nullable$.
\end{center}
Given the arguments $s$ and $r_1$, we denote the list of strings
$s''$ satisfying the above property as $\vsuf{s}{r_1}$.
The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string\footnote{
	Perhaps a better name of it would be ``NullablePrefixSuffix''
	to differentiate with the list of \emph{all} prefixes of $s$, but
	that is a bit too long for a function name and we are yet to find
a more concise and easy-to-understand name.}
\begin{center}
	\begin{tabular}{lcl}
		$\vsuf{[]}{\_} $ & $=$ &  $[]$\\
		$\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} \; (\rnullable{r_1}) \; \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\
				    && $\textit{else} \; (\vsuf{cs}{(\rder{c}{r_1}) })  $
	\end{tabular}
\end{center}
\noindent
The list starts with shorter suffixes
and ends with longer ones (in other words, the string elements $s''$
in the list $\vsuf{s}{r_1}$ are sorted
in the same order as that of the terms $r_2\backslash s''$ 
appearing in $(r_1\cdot r_2)\backslash s$).
In essence, $\vsuf{\_}{\_}$ is doing a 
"virtual derivative" of $r_1 \cdot r_2$, but instead of producing 
the entire result $(r_1 \cdot r_2) \backslash s$, 
it only stores strings,
with each string $s''$ representing a term such that $r_2 \backslash s''$
is occurring in $(r_1\cdot r_2)\backslash s$.

With $\textit{Suffix}$ we are ready to express the
sequence regular expression's closed form,
but before doing so 
more devices are needed.
The first thing is the flattening function $\sflat{\_}$,
which takes an alternative regular expression and produces a flattened version
of that alternative regular expression.
It is needed to convert
a left-associative nested sequence of alternatives into 
a flattened list:
\[
	\sum(\ldots ((r_1 + r_2) + r_3) + \ldots)
	\stackrel{\sflat{\_}}{\rightarrow} 
	\sum[r_1, r_2, r_3, \ldots]
\]
\noindent
The definitions of $\sflat{\_}$ and helper functions
$\sflataux{\_}$ and $\llparenthesis \_ \rrparenthesis''$ are given below.
\begin{center}  
	\begin{tabular}{lcl}
		$\sflataux{\sum r :: rs}$ & $\dn$ & $\sflataux{r} @ rs$\\
		$\sflataux{\sum []}$ & $ \dn $ & $ []$\\
		$\sflataux r$ & $\dn$ & $ [r]$
	\end{tabular}
\end{center}

\begin{center} 
	\begin{tabular}{lcl}
		$\sflat{(\sum r :: rs)}$ & $\dn$ & $\sum (\sflataux{r} @ rs)$\\
		$\sflat{\sum []}$ & $ \dn $ & $ \sum []$\\
		$\sflat r$ & $\dn$ & $ r$
	\end{tabular}
\end{center}

\begin{center}  
	\begin{tabular}{lcl}
		$\sflataux{[]}'$ & $ \dn $ & $ []$\\
		$\sflataux{ (r_1 + r_2) :: rs }'$ & $\dn$ & $r_1 :: r_2 :: rs$\\
		$\sflataux{r :: rs}$ & $\dn$ & $ r::rs$
	\end{tabular}
\end{center}
\noindent
$\sflataux{\_}$ breaks up nested alternative regular expressions 
of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape
into a "balanced" list: $\AALTS{\_}{[r_1,\, r_2 ,\, r_3, \ldots]}$.
It will return the singleton list $[r]$ otherwise.
$\sflat{\_}$ works the same  as $\sflataux{\_}$, except that it keeps
the output type a regular expression, not a list.
$\sflataux{\_}$  and $\sflat{\_}$ are only recursive on the  
first element of the list.
$\sflataux{\_}'$ takes a list of regular expressions as input, and outputs
a list of regular expressions.
The use of $\sflataux{\_}$ and $\sflataux{\_}'$ is clear once we have
$\textit{createdBySequence}$ defined:
\begin{center}
	\begin{mathpar}
		\inferrule{\mbox{}}{\textit{createdBySequence}\; (r_1 \cdot r_2)}

		\inferrule{\textit{createdBySequence} \; r_1}{\textit{createdBySequence} \;
		(r_1 + r_2)}
	\end{mathpar}
\end{center}
\noindent
The predicate $\textit{createdBySequence}$ is used to describe the shape of
the derivative regular expressions $(r_1\cdot r_2) \backslash s$:
\begin{lemma}\label{recursivelyDerseq}
	It is always the case that
	\begin{center}
		$\textit{createdBySequence} \; ( (r_1\cdot r_2) \backslash_r s) $
	\end{center}
	holds.
\end{lemma}
\begin{proof}
	By a reverse induction on the string $s$, where the inductive cases are $[]$
	and $xs  @ [x]$.
\end{proof}
\noindent
If we have a regular expression $r$ whose shpae 
fits into those described by $\textit{createdBySequence}$,
then we can convert between 
$r \backslash_r c$ and $(\sflataux{r}) \backslash_r c$ with
$\sflataux{\_}'$:
\begin{lemma}\label{sfauIdemDer}
	If $\textit{createdBySequence} \; r$, then 
	\begin{center}
		$\sflataux{ r \backslash_r c} = 
		\llparenthesis (\map \; (\_ \backslash_r c) \; (\sflataux{r}) ) \rrparenthesis''$
	\end{center}
	holds.
\end{lemma}
\begin{proof}
	By a simple induction on the inductive cases of $\textit{createdBySequence}.
	$
\end{proof}

Now we are ready to express
the shape of $r_1 \cdot r_2 \backslash s$
\begin{lemma}\label{seqSfau0}
	$\sflataux{(r_1 \cdot r_2) \backslash_r s} = (r_1 \backslash_r s) \cdot r_2 
	:: (\map \; (r_2 \backslash_r \_) \; (\textit{Suffix} \; s \; r1))$ 
\end{lemma}
\begin{proof}
	By a reverse induction on the string $s$, where the inductive cases 
	are $[]$ and $xs @ [x]$.
	For the inductive case, we know that $\textit{createdBySequence} \; ((r_1 \cdot r_2)
	\backslash_r xs)$ holds from lemma \ref{recursivelyDerseq},
	which can be used to prove
	\[
		\map \; (r_2 \backslash_r \_) \; (\vsuf{[x]}{(r_1 \backslash_r xs)}) \;\; @ \;\;
		\map \; (\_ \backslash_r x) \; (\map \; (r_2 \backslash \_) \; (\vsuf{xs}{r_1}))
	\]
	=
	\[
		\map \; (r_2 \backslash_r \_) \; (\vsuf{xs @ [x]}{r_1})
	\]
	using lemma \ref{sfauIdemDer}.
	This equality enables the inductive case to go through.
\end{proof}
\noindent 
This lemma says that $(r_1\cdot r_2)\backslash s$
can be flattened into a list whose head and tail meet the description
we gave earlier.
%Note that this lemma does $\mathbf{not}$ depend on any
%specific definitions we used,
%allowing people investigating derivatives to get an alternative
%view of what $r_1 \cdot r_2$ is.

We now use $\textit{createdBySequence}$ and
$\sflataux{\_}$ to describe an intuition
behind the closed form of the sequence closed form.
If two regular expressions only differ in the way their
alternatives are nested, then we should be able to get the same result
once we apply simplification to both of them:
\begin{lemma}\label{sflatRsimpeq}
	If $r$ is created from a sequence through
	a series of derivatives 
	(i.e. if $\textit{createdBySequence} \; r$ holds), 
	and that $\sflataux{r} = rs$,
	then we have
	that 
	\begin{center}
		$\textit{rsimp} \; r = \textit{rsimp} \; (\sum \; rs)$
	\end{center}
	holds.
\end{lemma}
\begin{proof}
	By an induction on the inductive cases of $\textit{createdBySequence}$. 
\end{proof}

Now we are ready for the closed form 
for the sequence regular expressions (without the inner applications
of simplifications):
\begin{lemma}\label{seqClosedFormGeneral}
	$\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} }
	=\rsimp{(\sum (  (r_1 \backslash s) \cdot r_2 :: 
	\map\; (r_2 \backslash \_) \; (\vsuf{s}{r_1})))}$
\end{lemma}
\begin{proof}
	We know that 
	$\sflataux{(r_1 \cdot r_2) \backslash_r s} = (r_1 \backslash_r s) \cdot r_2 
	:: (\map \; (r_2 \backslash_r \_) \; (\textit{Suffix} \; s \; r1))$
	holds
	by lemma \ref{seqSfau0}.
	This allows the theorem to go through because of lemma \ref{sflatRsimpeq}.
\end{proof}
Together with the idempotency property of $\rsimp{}$ (lemma \ref{rsimpIdem}),
it is possible to convert the above lemma to obtain the
proper closed form for $\backslash_{rsimps}$ rather than $\backslash_r$:
for  derivatives nested with simplification:
\begin{theorem}\label{seqClosedForm}
	$\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{(\sum ((r_1 \backslash s) \cdot r_2 ) 
	:: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1})))}$
\end{theorem}
\begin{proof}
	By a case analysis of the string $s$.
	When $s$ is an empty list, the rewrite is straightforward.
	When $s$ is a non-empty list, the
	lemmas \ref{seqClosedFormGeneral} and \ref{Simpders} apply,
	making the proof go through.
\end{proof}
\subsubsection{Closed Forms for Star Regular Expressions}
The closed form for the star regular expression involves similar tricks
for the sequence regular expression.
The $\textit{Suffix}$ function is now replaced by something
slightly more complex, because the growth pattern of star
regular expressions' derivatives is a bit different:
\begin{center}
	\begin{tabular}{lclc}
		$r^* $ & $\longrightarrow_{\backslash c}$ & 
		$(r\backslash c)  \cdot  r^*$ & $\longrightarrow_{\backslash c'}$\\
		\\
		$r \backslash cc'  \cdot r^* + r \backslash c' \cdot r^*$ &
		$\longrightarrow_{\backslash c''}$ & 
		$(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + 
		(r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*)$ & 
		$\longrightarrow_{\backslash c'''}$ \\
		\\
		$\ldots$\\
	\end{tabular}
\end{center}
When we have a string $s = c :: c' :: c'' \ldots$  such that $r \backslash c$, $r \backslash cc'$, $r \backslash c'$, 
$r \backslash cc'c''$, $r \backslash c'c''$, $r\backslash c''$ etc. are all nullable,
the number of terms in $r^* \backslash s$ will grow exponentially rather than linearly
in the sequence case.
The good news is that the function $\textit{rsimp}$ will again
ignore the difference between differently nesting patterns of alternatives,
and the exponentially growing star derivative like
\begin{center}
	$(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + 
	(r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ 
\end{center}
can be treated as
\begin{center}
	$\RALTS{[r_1 \backslash cc'c'' \cdot r^*, r \backslash c'', 
	r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$
\end{center}
which can be de-duplicated by $\rDistinct$
and therefore bounded finitely.

and then de-duplicate terms of the form  ($s'$ being a substring of $s$).
This allows us to use a similar technique as $r_1 \cdot r_2$ case,

Now the crux of this section is finding a suitable description
for $rs$ where
\begin{center}
	$\rderssimp{r^*}{s} = \rsimp{\sum rs}$.
\end{center}
holds.
In addition, the list $rs$
shall be in the form of 
$\map \; (\lambda s'. r\backslash s' \cdot r^*) \; Ss$.
The $Ss$ is a list of strings, and for example in the sequence
closed form it is specified as $\textit{Suffix} \; s \; r_1$.
To get $Ss$ for the star regular expression,
we need to introduce $\starupdate$ and $\starupdates$:
\begin{center}
	\begin{tabular}{lcl}
		$\starupdate \; c \; r \; [] $ & $\dn$ & $[]$\\
		$\starupdate \; c \; r \; (s :: Ss)$ & $\dn$ & \\
						     & & $\textit{if} \; 
						     (\rnullable \; (r \backslash_{rs} s))$ \\
						     & & $\textit{then} \;\; (s @ [c]) :: [c] :: (
						     \starupdate \; c \; r \; Ss)$ \\
						     & & $\textit{else} \;\; (s @ [c]) :: (
						     \starupdate \; c \; r \; Ss)$
	\end{tabular}
\end{center}
\begin{center}
	\begin{tabular}{lcl}
		$\starupdates \; [] \; r \; Ss$ & $=$ & $Ss$\\
		$\starupdates \; (c :: cs) \; r \; Ss$ &  $=$ &  $\starupdates \; cs \; r \; (
		\starupdate \; c \; r \; Ss)$
	\end{tabular}
\end{center}
\noindent
Assuming we have that
\begin{center}
	$\rderssimp{r^*}{s} = \rsimp{(\sum \map \; (\lambda s'. r\backslash s' \cdot r^*) \; Ss)}$.
\end{center}
holds.
The idea of $\starupdate$ and $\starupdates$
is to update $Ss$ when another
derivative is taken on $\rderssimp{r^*}{s}$
w.r.t a character $c$ and a string $s'$
respectively.
Both $\starupdate$ and $\starupdates$ take three arguments as input:
the new character $c$ or string $s$ to take derivative with, 
the regular expression
$r$ under the star $r^*$, and the
list of strings $Ss$ for the derivative $r^* \backslash s$ 
up until this point  
such that 
\begin{center}
$(r^*) \backslash s = \sum_{s' \in sSet} (r\backslash s') \cdot r^*$ 
\end{center}
is satisfied.

Functions $\starupdate$ and $\starupdates$ characterise what the 
star derivatives will look like once ``straightened out'' into lists.
The helper functions for such operations will be similar to
$\sflat{\_}$, $\sflataux{\_}$ and $\sflataux{\_}$, which we defined for sequence.
We use similar symbols to denote them, with a $*$ subscript to mark the difference.
\begin{center}
	\begin{tabular}{lcl}
		$\hflataux{r_1 + r_2}$ & $\dn$ & $\hflataux{r_1} @ \hflataux{r_2}$\\
		$\hflataux{r}$ & $\dn$ & $[r]$
	\end{tabular}
\end{center}

\begin{center}
	\begin{tabular}{lcl}
		$\hflat{r_1 + r_2}$ & $\dn$ & $\sum (\hflataux {r_1} @ \hflataux {r_2}) $\\
		$\hflat{r}$ & $\dn$ & $r$
	\end{tabular}
\end{center}
\noindent
These definitions are tailor-made for dealing with alternatives that have
originated from a star's derivatives.
A typical star derivative always has the structure of a balanced binary tree:
\begin{center}
	$(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + 
	(r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ 
\end{center}
All of the nested structures of alternatives
generated from derivatives are binary, and therefore
$\hflat{\_}$ and $\hflataux{\_}$ only deal with binary alternatives.
$\hflat{\_}$ ``untangles'' like the following:
\[
	\sum ((r_1 + r_2) + (r_3 + r_4))  + \ldots \;
	\stackrel{\hflat{\_}}{\longrightarrow} \;
	\RALTS{[r_1, r_2, \ldots, r_n]} 
\]
Here is a lemma stating the recursive property of $\starupdate$ and $\starupdates$,
with the helpers $\hflat{\_}$ and $\hflataux{\_}$\footnote{The function $\textit{concat}$ takes a list of lists 
			and merges each of the element lists to form a flattened list.}:
\begin{lemma}\label{stupdateInduct1}
	\mbox
	For a list of strings $Ss$, the following hold.
	\begin{itemize}
		\item
			If we do a derivative on the terms 
			$r\backslash_r s \cdot r^*$ (where $s$ is taken from the list $Ss$),
			the result will be the same as if we apply $\starupdate$ to $Ss$.
			\begin{center}
				\begin{tabular}{c}
			$\textit{concat} \; (\map \; (\hflataux{\_} \circ ( (\_\backslash_r x)
			\circ (\lambda s.\;\; (r \backslash_r s) \cdot r^*)))\; Ss )\;
			$\\
			\\
			$=$ \\
			\\
			$\map \; (\lambda s. (r \backslash_r s) \cdot (r^*)) \; 
			(\starupdate \; x \; r \; Ss)$
				\end{tabular}
			\end{center}
		\item
			$\starupdates$ is ``composable'' w.r.t a derivative.
			It piggybacks the character $x$ to the tail of the string
			$xs$.
			\begin{center}
				\begin{tabular}{c}
					$\textit{concat} \; (\map \; \hflataux{\_} \; 
					(\map \; (\_\backslash_r x) \; 
					(\map \; (\lambda s.\;\; (r \backslash_r s) \cdot 
					(r^*) ) \; (\starupdates \; xs \; r \; Ss))))$\\
					\\
					$=$\\
					\\
					$\map \; (\lambda s.\;\; (r\backslash_r s) \cdot (r^*)) \;
					(\starupdates \; (xs @ [x]) \; r \; Ss)$
				\end{tabular}
			\end{center}
	\end{itemize}
\end{lemma}
			
\begin{proof}
	Part 1 is by induction on $Ss$.
	Part 2 is by induction on $xs$, where $Ss$ is left to take arbitrary values.
\end{proof}
			

Like $\textit{createdBySequence}$, we need
a predicate for ``star-created'' regular expressions:
\begin{center}
	\begin{mathpar}
		\inferrule{\mbox{}}{ \textit{createdByStar}\; \RSEQ{ra}{\RSTAR{rb}} }

		\inferrule{  \textit{createdByStar} \; r_1\; \land  \; \textit{createdByStar} \; r_2 }{\textit{createdByStar} \; (r_1 + r_2) } 
	\end{mathpar}
\end{center}
\noindent
All regular expressions created by taking derivatives of
$r_1 \cdot (r_2)^*$ satisfy the $\textit{createdByStar}$ predicate:
\begin{lemma}\label{starDersCbs}
	$\textit{createdByStar} \; ((r_1 \cdot r_2^*) \backslash_r s) $ holds.
\end{lemma}
\begin{proof}
	By a reverse induction on $s$.
\end{proof}
If a regular expression conforms to the shape of a star's derivative,
then we can push an application of $\hflataux{\_}$ inside a derivative of it:
\begin{lemma}\label{hfauPushin}
	If $\textit{createdByStar} \; r$ holds, then
	\begin{center}
		$\hflataux{r \backslash_r c} = \textit{concat} \; (
		\map \; \hflataux{\_} (\map \; (\_\backslash_r c) \;(\hflataux{r})))$
	\end{center}
	holds.
\end{lemma}
\begin{proof}
	By an induction on the inductivev cases of $\textit{createdByStar}$.
\end{proof}
%This is not entirely true for annotated regular expressions: 
%%TODO: bsimp bders \neq bderssimp
%\begin{center}
%	$(1+ (c\cdot \ASEQ{bs}{c^*}{c} ))$
%\end{center}
%For bit-codes, the order in which simplification is applied
%might cause a difference in the location they are placed.
%If we want something like
%\begin{center}
%	$\bderssimp{r}{s} \myequiv \bsimp{\bders{r}{s}}$
%\end{center}
%Some "canonicalization" procedure is required,
%which either pushes all the common bitcodes to nodes
%as senior as possible:
%\begin{center}
%	$_{bs}(_{bs_1 @ bs'}r_1 + _{bs_1 @ bs''}r_2) \rightarrow _{bs @ bs_1}(_{bs'}r_1 + _{bs''}r_2) $
%\end{center}
%or does the reverse. However bitcodes are not of interest if we are talking about
%the $\llbracket r \rrbracket$ size of a regex.
%Therefore for the ease and simplicity of producing a
%proof for a size bound, we are happy to restrict ourselves to 
%unannotated regular expressions, and obtain such equalities as
%TODO: rsimp sflat
% The simplification of a flattened out regular expression, provided it comes
%from the derivative of a star, is the same as the one nested.



Now we introduce an inductive property
for $\starupdate$ and $\hflataux{\_}$.
\begin{lemma}\label{starHfauInduct}
	If we do derivatives of $r^*$
	with a string that starts with $c$,
	then flatten it out,
	we obtain a list
	of the shape $\sum_{s' \in sS} (r\backslash_r s') \cdot r^*$,
	where $sS = \starupdates \; s \; r \; [[c]]$. Namely,
	\begin{center}
	$\hflataux{(( (\rder{c}{r_0})\cdot(r_0^*))\backslash_{rs} s)} = 
		\map \; (\lambda s_1. (r_0 \backslash_r s_1) \cdot (r_0^*)) \; 
		(\starupdates \; s \; r_0 \; [[c]])$
	\end{center}
holds.
\end{lemma}
\begin{proof}
	By an induction on $s$, the inductive cases
	being $[]$ and $s@[c]$. The lemmas \ref{hfauPushin} and \ref{starDersCbs} are used.
\end{proof}
\noindent

$\hflataux{\_}$ has a similar effect as $\textit{flatten}$:
\begin{lemma}\label{hflatauxGrewrites}
	$a :: rs \grewrites \hflataux{a} @ rs$
\end{lemma}
\begin{proof}
	By induction on $a$. $rs$ is set to take arbitrary values.
\end{proof}
It is also not surprising that $\textit{rsimp}$ will cover
the effect of $\hflataux{\_}$:
\begin{lemma}\label{cbsHfauRsimpeq1}
	$\rsimp{(r_1 + r_2)} = \rsimp{(\RALTS{\hflataux{r_1} @ \hflataux{r_2}})}$
\end{lemma}

\begin{proof}
	By using the rewriting relation $\rightsquigarrow$
\end{proof}
And from this we obtain a proof that a star's derivative will be the same
as if it had all its nested alternatives created during deriving being flattened out:
For example,
\begin{lemma}\label{hfauRsimpeq2}
	$\createdByStar{r} \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$
\end{lemma}
\begin{proof}
	By structural induction on $r$, where the induction rules 
	are these of $\createdByStar{\_}$.
	Lemma \ref{cbsHfauRsimpeq1} is used in the inductive case.
\end{proof}


%Here is a corollary that states the lemma in
%a more intuitive way:
%\begin{corollary}
%	$\hflataux{r^* \backslash_r (c::xs)} = \map \; (\lambda s. (r \backslash_r s) \cdot
%	(r^*))\; (\starupdates \; c\; r\; [[c]])$
%\end{corollary}
%\noindent
%Note that this is also agnostic of the simplification
%function we defined, and is therefore of more general interest.

Together with the rewriting relation
\begin{lemma}\label{starClosedForm6Hrewrites}
	We have the following set of rewriting relations or equalities:
	\begin{itemize}
		\item
			$\textit{rsimp} \; (r^* \backslash_r (c::s)) 
			\sequal
			\sum \; ( ( \sum (\lambda s. (r\backslash_r s) \cdot r^*) \; (
			\starupdates \; s \; r \; [ c::[]] ) ) )$
		\item
			$r \backslash_{rsimps} (c::s) = \textit{rsimp} \; ( (
			\sum ( (\map \; (\lambda s_1. (r\backslash s_1) \; r^*) \;
			(\starupdates \;s \; r \; [ c::[] ])))))$
		\item
			$\sum ( (\map \; (\lambda s. (r\backslash s) \; r^*) \; Ss))
			\sequal
			 \sum ( (\map \; (\lambda s. \textit{rsimp} \; (r\backslash s) \;
			 r^*) \; Ss) )$
		\item
			$\map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss
			\scfrewrites
			\map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss$
		\item
			$( ( \sum ( ( \map \ (\lambda s. \;\;
			(\textit{rsimp} \; (r \backslash_r s)) \cdot r^*) \; (\starupdates \;
			s \; r \; [ c::[] ])))))$\\
			$\sequal$\\
			$( ( \sum ( ( \map \ (\lambda s. \;\;
			( r \backslash_{rsimps} s)) \cdot r^*) \; (\starupdates \;
			s \; r \; [ c::[] ]))))$\\
	\end{itemize}
\end{lemma}
\begin{proof}
	Part 1 leads to part 2.
	The rest of them are routine.
\end{proof}
\noindent
Next the closed form for star regular expressions can be derived:
\begin{theorem}\label{starClosedForm}
	$\rderssimp{r^*}{c::s} = 
	\rsimp{
		(\sum (\map \; (\lambda s. (\rderssimp{r}{s})\cdot r^*) \; 
		(\starupdates \; s\; r \; [[c]])
		)
		)
	}
	$
\end{theorem}
\begin{proof}
	By an induction on $s$.
	The lemmas \ref{rsimpIdem}, \ref{starHfauInduct}, \ref{starClosedForm6Hrewrites} 
	and \ref{hfauRsimpeq2}
	are used.	
	In \ref{starClosedForm6Hrewrites}, the equalities are
	used to link the LHS and RHS.
\end{proof}






%----------------------------------------------------------------------------------------
%	SECTION ??
%----------------------------------------------------------------------------------------

%-----------------------------------
%	SECTION syntactic equivalence under simp
%-----------------------------------


%----------------------------------------------------------------------------------------
%	SECTION ALTS CLOSED FORM
%----------------------------------------------------------------------------------------
%\section{A Closed Form for \textit{ALTS}}
%Now we prove that  $rsimp (rders\_simp (RALTS rs) s) = rsimp (RALTS (map (\lambda r. rders\_simp r s) rs))$.
%
%
%There are a few key steps, one of these steps is
%
%
%
%One might want to prove this by something a simple statement like: 
%
%For this to hold we want the $\textit{distinct}$ function to pick up
%the elements before and after derivatives correctly:
%$r \in rset \equiv (rder x r) \in (rder x rset)$.
%which essentially requires that the function $\backslash$ is an injective mapping.
%
%Unfortunately the function $\backslash c$ is not an injective mapping.
%
%\subsection{function $\backslash c$ is not injective (1-to-1)}
%\begin{center}
%	The derivative $w.r.t$ character $c$ is not one-to-one.
%	Formally,
%	$\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$
%\end{center}
%This property is trivially true for the
%character regex example:
%\begin{center}
%	$r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$
%\end{center}
%But apart from the cases where the derivative
%output is $\ZERO$, are there non-trivial results
%of derivatives which contain strings?
%The answer is yes.
%For example,
%\begin{center}
%	Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\
%	where $a$ is not nullable.\\
%	$r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\
%	$r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$
%\end{center}
%We start with two syntactically different regular expressions,
%and end up with the same derivative result.
%This is not surprising as we have such 
%equality as below in the style of Arden's lemma:\\
%\begin{center}
%	$L(A^*B) = L(A\cdot A^* \cdot B + B)$
%\end{center}
\section{Bounding Closed Forms}

In this section, we introduce how we formalised the bound
on closed forms.
We first show that in general regular expressions up to a certain 
size are finite.
Then we prove that functions such as $\rflts$
will not cause the size of r-regular expressions to grow.
Putting this together with a general bound 
on the finiteness of distinct regular expressions
up to a certain size, we obtain a bound on 
the closed forms.

\subsection{Finiteness of Distinct Regular Expressions}
We define the set of regular expressions whose size are no more than
a certain size $N$ as $\textit{sizeNregex} \; N$:
\[
	\textit{sizeNregex} \; N \dn \{r\; \mid \;  \llbracket r \rrbracket_r \leq N \}
\]
We have that $\textit{sizeNregex} \; N$ is always a finite set:
\begin{lemma}\label{finiteSizeN}
	$\textit{finite} \; (\textit{sizeNregex} \; N)$ holds.
\end{lemma}
\begin{proof}
	By splitting the set $\textit{sizeNregex} \; (N + 1)$ into
	subsets by their categories:
	$\{\ZERO, \ONE, c\}$, $\{* `` \textit{sizeNregex} \; N\}$,
	and so on. Each of these subsets are finitely bounded.
\end{proof}
\noindent
From this we get a corollary that
if forall $r \in rs$, $\rsize{r} \leq N$, then the output of 
$\rdistinct{rs}{\varnothing}$ is a list of regular
expressions of finite size depending on $N$ only. 
\begin{corollary}\label{finiteSizeNCorollary}
	$\rsize{\rdistinct{rs}{\varnothing}} \leq c_N * N$ holds,
	where the constant $c_N$ is equal to $\textit{card} \; (\textit{sizeNregex} \;
	N)$.
\end{corollary}
\begin{proof}
	For all $r$ in 
	$\textit{set} \; (\rdistinct{rs}{\varnothing})$,
	it is always the case that $\rsize{r} \leq N$.
	In addition, the list length is bounded by
	$c_N$, yielding the desired bound.
\end{proof}
\noindent
This fact will be handy in estimating the closed form sizes.
%We have proven that the size of the
%output of $\textit{rdistinct} \; rs' \; \varnothing$
%is bounded by a constant $N * c_N$ depending only on $N$,
%provided that each of $rs'$'s element
%is bounded by $N$.

\subsection{$\textit{rsimp}$ Does Not Increment the Size}
Although it seems evident, we need a series
of non-trivial lemmas to establish that functions such as $\rflts$
do not cause the regular expressions to grow.
\begin{lemma}\label{rsimpMonoLemmas}
	\mbox{}
	\begin{itemize}
		\item
			\[
				\llbracket \rsimpalts \; rs \rrbracket_r \leq
				\llbracket \sum \; rs \rrbracket_r
			\]
		\item
			\[
				\llbracket \rsimpseq \; r_1 \;  r_2 \rrbracket_r \leq
				\llbracket r_1 \cdot r_2 \rrbracket_r
			\]
		\item
			\[
				\llbracket \rflts \; rs \rrbracket_r  \leq
				\llbracket rs \rrbracket_r 
			\]
		\item
			\[
				\llbracket \rDistinct \; rs \; ss \rrbracket_r  \leq
				\llbracket rs \rrbracket_r 
			\]
		\item
			If all elements $a$ in the set $as$ satisfy the property
			that $\llbracket \textit{rsimp} \; a \rrbracket_r \leq
			\llbracket a \rrbracket_r$, then we have 
			\[
				\llbracket \; \rsimpalts \; (\textit{rdistinct} \;
				(\textit{rflts} \; (\textit{map}\;\textit{rsimp} as)) \{\})
				\rrbracket \leq
				\llbracket \; \sum \; (\rDistinct \; (\rflts \;(\map \;
				\textit{rsimp} \; x))\; \{ \} ) \rrbracket_r 
			\]
	\end{itemize}
\end{lemma}
\begin{proof}
	Point 1, 3, 4 can be proven by an induction on $rs$.
	Point 2 is by case analysis on $r_1$ and $r_2$.
	The last part is a corollary of the previous ones.
\end{proof}
\noindent
With the lemmas for each inductive case in place, we are ready to get 
the non-increasing property as a corollary:
\begin{corollary}\label{rsimpMono}
	$\llbracket \textit{rsimp} \; r \rrbracket_r \leq \llbracket r \rrbracket_r$
\end{corollary}
\begin{proof}
	By \ref{rsimpMonoLemmas}.
\end{proof}

\subsection{Estimating the Closed Forms' sizes}
We recap the closed forms we obtained
earlier by putting them together down below:
\begin{itemize}
	\item
		$\rderssimp{(\sum rs)}{s} \sequal
		\sum \; (\map \; (\rderssimp{\_}{s}) \; rs)$
	\item
		$\rderssimp{(r_1 \cdot r_2)}{s} \sequal \sum ((r_1 \backslash s) \cdot r_2 ) 
		:: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1}))$
	\item

		$\rderssimp{r^*}{c::s} = 
		\rsimp{
			(\sum (\map \; (\lambda s. (\rderssimp{r}{s})\cdot r^*) \; 
			(\starupdates \; s\; r \; [[c]])
			)
			)
		}
		$
\end{itemize}	
\noindent	
The closed forms on the left-hand-side
are all of the same shape: $\rsimp{ (\sum rs)} $.
Such regular expression will be bounded by the size of $\sum rs'$, 
where every element in $rs'$ is distinct, and each element 
can be described by some inductive sub-structures 
(for example when $r = r_1 \cdot r_2$ then $rs'$ 
will be solely comprised of $r_1 \backslash s'$ 
and $r_2 \backslash s''$, $s'$ and $s''$ being 
sub-strings of $s$).
which will each have a size uppder bound 
according to inductive hypothesis, which controls $r \backslash s$.

We elaborate the above reasoning by a series of lemmas
below, where straightforward proofs are omitted.




We want to apply it to our setting $\rsize{\rsimp{\sum rs}}$.

We show that $\rdistinct$ and $\rflts$
working together is at least as 
good as $\rdistinct{}{}$ alone, which can be written as
\begin{center}
	$\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r 
	\leq 
	\llbracket \rdistinct{rs}{\varnothing}  \rrbracket_r  $.
\end{center}
We need this so that we know the outcome of our real 
simplification is better than or equal to a rough estimate,
and therefore can be bounded by that estimate.
This is a bit harder to establish compared with proving
$\textit{flts}$ does not make a list larger (which can
be proven using routine induction):
\begin{center}
	$\llbracket  \textit{rflts}\; rs \rrbracket_r \leq
	\llbracket  \textit{rs} \rrbracket_r$
\end{center}
We cannot simply prove how each helper function
reduces the size and then put them together:
From
\begin{center}
$\llbracket  \textit{rflts}\; rs \rrbracket_r \leq
	\llbracket  \textit{rs} \rrbracket_r$
\end{center}
and
\begin{center}
     $\llbracket  \textit{rdistinct} \; rs \; \varnothing \leq
     \llbracket rs \rrbracket_r$
\end{center}
one cannot infer 
\begin{center}
	$\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r 
	\leq 
	\llbracket \rdistinct{rs}{\varnothing}  \rrbracket_r  $.
\end{center}
What we can infer is that 
\begin{center}
	$\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r 
	\leq
	\llbracket rs \rrbracket_r$
\end{center}
but this estimate is too rough and $\llbracket rs \rrbracket_r$	is unbounded.
The way we 
get around this is by first proving a more general lemma 
(so that the inductive case goes through):
\begin{lemma}\label{fltsSizeReductionAlts}
	If we have three accumulator sets:
	$noalts\_set$, $alts\_set$ and $corr\_set$,
	satisfying:
	\begin{itemize}
		\item
			$\forall r \in noalts\_set. \; \nexists xs.\; r = \sum  xs$
		\item
			$\forall r \in alts\_set. \; \exists xs. \; r = \sum xs
			\; \textit{and} \; set \; xs \subseteq corr\_set$
	\end{itemize}
	then we have that
	\begin{center}
	\begin{tabular}{lcl}
	$\llbracket  (\textit{rdistinct} \; (\textit{rflts} \; as) \;
	(noalts\_set \cup corr\_set)) \rrbracket_r$ & $\leq$ &\\
						    $\llbracket  (\textit{rdistinct} \; as \; (noalts\_set \cup alts\_set \cup
	\{ \ZERO \} )) \rrbracket_r$ & & \\ 
	\end{tabular}
	\end{center}
		holds.
\end{lemma}
\noindent
We split the accumulator into two parts: the part
which contains alternative regular expressions ($alts\_set$), and 
the part without any of them($noalts\_set$).
This is because $\rflts$ opens up the alternatives in $as$,
causing the accumulators on both sides of the inequality
to diverge slightly.
If we want to compare the accumulators that are not
perfectly in sync, we need to consider the alternatives and non-alternatives
separately.
The set $corr\_set$ is the corresponding set
of $alts\_set$ with all elements under the alternative constructor
spilled out.
\begin{proof}
	By induction on the list $as$. We make use of lemma \ref{rdistinctConcat}.
\end{proof}
By setting all three sets to the empty set, one gets the desired size estimate:
\begin{corollary}\label{interactionFltsDB}
	$\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r 
	\leq 
	\llbracket \rdistinct{rs}{\varnothing}  \rrbracket_r  $.
\end{corollary}
\begin{proof}
	By using the lemma \ref{fltsSizeReductionAlts}.
\end{proof}
\noindent
The intuition for why this is true
is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of 
duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$. 

Now this $\rsimp{\sum rs}$ can be estimated using $\rdistinct{rs}{\varnothing}$:
\begin{lemma}\label{altsSimpControl}
	$\rsize{\rsimp{\sum rs}} \leq \rsize{\rdistinct{rs}{\varnothing}}+ 1$
\end{lemma}
\begin{proof}
	By using corollary \ref{interactionFltsDB}.
\end{proof}
\noindent
This is a key lemma in establishing the bounds on all the 
closed forms.
With this we are now ready to control the sizes of
$(r_1 \cdot r_2 )\backslash s$ and $r^* \backslash s$.
\begin{theorem}\label{rBound}
	For any regex $r$, $\exists N_r. \forall s. \; \rsize{\rderssimp{r}{s}} \leq N_r$
\end{theorem}
\noindent
\begin{proof}
	We prove this by induction on $r$. The base cases for $\RZERO$,
	$\RONE $ and $\RCHAR{c}$ are straightforward. 
	In the sequence $r_1 \cdot r_2$ case,
	the inductive hypotheses state 
	$\exists N_1. \forall s. \; \llbracket \rderssimp{r}{s} \rrbracket \leq N_1$ and
	$\exists N_2. \forall s. \; \llbracket \rderssimp{r_2}{s} \rrbracket \leq N_2$. 

	When the string $s$ is not empty, we can reason as follows
	%
	\begin{center}
		\begin{tabular}{lcll}
& & $ \llbracket   \rderssimp{r_1\cdot r_2 }{s} \rrbracket_r $\\
& $ = $ & $\llbracket \rsimp{(\sum(r_1 \backslash_{rsimps} s \cdot r_2 \; \;  :: \; \; 
		\map \; (r_2\backslash_{rsimps} \_)\; (\vsuf{s}{r})))} \rrbracket_r $ & (1) \\			
										     & $\leq$ & $\llbracket \rdistinct{(r_1 \backslash_{rsimps} s \cdot r_2 \; \;  :: \; \; 
	\map \; (r_2\backslash_{rsimps} \_)\; (\vsuf{s}{r}))}{\varnothing} \rrbracket_r  + 1$ & (2) \\
											     & $\leq$ & $2 + N_1 + \rsize{r_2} + (N_2 * (card\;(\sizeNregex \; N_2)))$ & (3)\\
\end{tabular}
\end{center}
\noindent
(1) is by theorem \ref{seqClosedForm}.
(2) is by \ref{altsSimpControl}.
(3) is by \ref{finiteSizeNCorollary}.


Combining the cases when $s = []$ and $s \neq []$, we get (4):
\begin{center}
	\begin{tabular}{lcll}
		$\rsize{(r_1 \cdot r_2) \backslash_r s}$ & $\leq$ & 
		$max \; (2 + N_1 + 
		\llbracket r_2 \rrbracket_r + 
		N_2 * (card\; (\sizeNregex \; N_2))) \; \rsize{r_1\cdot r_2}$ & (4)
	\end{tabular}
\end{center}

We reason similarly for  $\STAR$.
The inductive hypothesis is
$\exists N. \forall s. \; \llbracket \rderssimp{r}{s} \rrbracket \leq N$.
Let $n_r = \llbracket r^* \rrbracket_r$.
When $s = c :: cs$ is not empty,
\begin{center}
	\begin{tabular}{lcll}
& & $ \llbracket   \rderssimp{r^* }{c::cs} \rrbracket_r $\\
& $ = $ & $\llbracket \rsimp{(\sum (\map \; (\lambda s. (r \backslash_{rsimps} s) \cdot r^*) \; (\starupdates\; 
	cs \; r \; [[c]] )) )} \rrbracket_r $ & (5) \\			
					      & $\leq$ & $\llbracket 
					      \rdistinct{
						      (\map \; 
						      (\lambda s. (r \backslash_{rsimps} s) \cdot r^*) \; 
						      (\starupdates\; cs \; r \; [[c]] )
					      )}
	{\varnothing} \rrbracket_r  + 1$ & (6) \\
					 & $\leq$ & $1 + (\textit{card} (\sizeNregex \; (N + n_r)))
	* (1 + (N + n_r)) $ & (7)\\
\end{tabular}
\end{center}
\noindent
(5) is by theorem \ref{starClosedForm}.
(6) is by \ref{altsSimpControl}.
(7) is by corollary \ref{finiteSizeNCorollary}.
Combining with the case when $s = []$, one gets
\begin{center}
	\begin{tabular}{lcll}
		$\rsize{r^* \backslash_r s}$ & $\leq$ & $max \; n_r \; 1 + (\textit{card} (\sizeNregex \; (N + n_r)))
		* (1 + (N + n_r)) $ & (8)\\
	\end{tabular}
\end{center}
\noindent

The alternative case is slightly less involved.
The inductive hypothesis 
is equivalent to $\exists N. \forall r \in (\map \; (\_ \backslash_r s) \; rs). \rsize{r} \leq N$.
In the case when $s = c::cs$, we have 
\begin{center}
	\begin{tabular}{lcll}
& & $ \llbracket   \rderssimp{\sum rs }{c::cs} \rrbracket_r $\\
& $ = $ & $\llbracket \rsimp{(\sum (\map \; (\_ \backslash_{rsimps} s)  \; rs) )} \rrbracket_r $ & (9) \\			
& $\leq$ & $\llbracket (\sum (\map \; (\_ \backslash_{rsimps} s)  \; rs) ) \rrbracket_r $  & (10) \\
& $\leq$ & $1 + N * (length \; rs) $ & (11)\\
	\end{tabular}
\end{center}
\noindent
(9) is by theorem \ref{altsClosedForm}, (10) by lemma \ref{rsimpMono} and (11) by inductive hypothesis.

Combining with the case when $s = []$, one gets
\begin{center}
	\begin{tabular}{lcll}
		$\rsize{\sum rs \backslash_r s}$ & $\leq$ & $max \; \rsize{\sum rs} \; 1+N*(length \; rs)$ 
						 & (12)\\
	\end{tabular}
\end{center}
We have all the inductive cases proven.
\end{proof}

This leads to our main result on the size bound:
\begin{corollary}
	For any annotated regular expression $a$, $\exists N_r. \forall s. \; \rsize{\bderssimp{a}{s}} \leq N_r$
\end{corollary}
\begin{proof}
	By lemma \ref{sizeRelations} and theorem \ref{rBound}.
\end{proof}
\noindent





%-----------------------------------
%	SECTION 2
%-----------------------------------


%----------------------------------------------------------------------------------------
%	SECTION 3
%----------------------------------------------------------------------------------------


\section{Comments and Future Improvements}
\subsection{Some Experimental Results}
What guarantee does this bound give us?
Whatever the regex is, it will not grow indefinitely.
Take our previous example $(a + aa)^*$ as an example:
\begin{center}
	\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
		\begin{tikzpicture}
			\begin{axis}[
				xlabel={number of $a$'s},
				x label style={at={(1.05,-0.05)}},
				ylabel={regex size},
				enlargelimits=false,
				xtick={0,5,...,30},
				xmax=33,
				ymax= 40,
				ytick={0,10,...,40},
				scaled ticks=false,
				axis lines=left,
				width=5cm,
				height=4cm, 
				legend entries={$(a + aa)^*$},  
				legend pos=south east,
				legend cell align=left]
				\addplot[red,mark=*, mark options={fill=white}] table {a_aa_star.data};
			\end{axis}
		\end{tikzpicture}
	\end{tabular}
\end{center}
We are able to limit the size of the regex $(a + aa)^*$'s derivatives
with our simplification
rules very effectively.


In our proof for the inductive case $r_1 \cdot r_2$, the dominant term in the bound
is $l_{N_2} * N_2$, where $N_2$ is the bound we have for $\llbracket \bderssimp{r_2}{s} \rrbracket$.
Given that $l_{N_2}$ is roughly the size $4^{N_2}$, the size bound $\llbracket \bderssimp{r_1 \cdot r_2}{s} \rrbracket$
inflates the size bound of $\llbracket \bderssimp{r_2}{s} \rrbracket$ with the function
$f(x) = x * 2^x$.
This means the bound we have will surge up at least
tower-exponentially with a linear increase of the depth.

One might be quite skeptical about what this non-elementary
bound can bring us.
It turns out that the giant bounds are far from being hit.
Here we have some test data from randomly generated regular expressions:
\begin{figure}[H]
	\begin{tabular}{@{}c@{\hspace{2mm}}c@{\hspace{0mm}}c@{}}
		\begin{tikzpicture}
			\begin{axis}[
				xlabel={$n$},
				x label style={at={(1.05,-0.05)}},
				ylabel={regex size},
				enlargelimits=false,
				xtick={0,5,...,30},
				xmax=33,
				%ymax=1000,
				%ytick={0,100,...,1000},
				scaled ticks=false,
				axis lines=left,
				width=4.75cm,
				height=3.8cm, 
				legend entries={regex1},  
				legend pos=north east,
				legend cell align=left]
				\addplot[red,mark=*, mark options={fill=white}] table {regex1_size_change.data};
			\end{axis}
		\end{tikzpicture}
 & 
  \begin{tikzpicture}
	  \begin{axis}[
		  xlabel={$n$},
		  x label style={at={(1.05,-0.05)}},
		  %ylabel={time in secs},
		  enlargelimits=false,
		  xtick={0,5,...,30},
		  xmax=33,
		  %ymax=1000,
		  %ytick={0,100,...,1000},
		  scaled ticks=false,
		  axis lines=left,
		  width=4.75cm,
		  height=3.8cm, 
		  legend entries={regex2},  
		  legend pos=south east,
		  legend cell align=left]
		  \addplot[blue,mark=*, mark options={fill=white}] table {regex2_size_change.data};
	  \end{axis}
  \end{tikzpicture}
 & 
  \begin{tikzpicture}
	  \begin{axis}[
		  xlabel={$n$},
		  x label style={at={(1.05,-0.05)}},
		  %ylabel={time in secs},
		  enlargelimits=false,
		  xtick={0,5,...,30},
		  xmax=33,
		  %ymax=1000,
		  %ytick={0,100,...,1000},
		  scaled ticks=false,
		  axis lines=left,
		  width=4.75cm,
		  height=3.8cm, 
		  legend entries={regex3},  
		  legend pos=south east,
		  legend cell align=left]
		  \addplot[cyan,mark=*, mark options={fill=white}] table {regex3_size_change.data};
	  \end{axis}
  \end{tikzpicture}\\
  \multicolumn{3}{c}{}
	\end{tabular}    
  \caption{Graphs: size change of 3 randomly generated 
  regular expressions $w.r.t.$ input string length. 
  The x axis represents the length of input.}
\end{figure}  
\noindent
Most of the regex's sizes seem to stay within a polynomial bound $w.r.t$ the 
original size.
We will discuss improvements to this bound in the next chapter.



\subsection{Possible Further Improvements}
There are two problems with this finiteness result, though.\\
(i)	
		First, it is not yet a direct formalisation of our lexer's complexity,
		as a complexity proof would require looking into 
		the time it takes to execute {\bf all} the operations
		involved in the lexer (simp, collect, decode), not just the derivative.\\
(ii)
		Second, the bound is not yet tight, and we seek to improve $N_a$ so that
		it is polynomial on $\llbracket a \rrbracket$.\\
Still, we believe this contribution is useful,
because
\begin{itemize}
	\item

		The size proof can serve as a starting point for a complexity
		formalisation.
		Derivatives are the most important phases of our lexer algorithm.
		Size properties about derivatives covers the majority of the algorithm
		and is therefore a good indication of complexity of the entire program.
	\item
		The bound is already a strong indication that catastrophic
		backtracking is much less likely to occur in our $\blexersimp$
		algorithm.
		We refine $\blexersimp$ with $\blexerStrong$ in the next chapter
		so that the bound becomes polynomial.
\end{itemize}

%----------------------------------------------------------------------------------------
%	SECTION 4
%----------------------------------------------------------------------------------------








One might wonder the actual bound rather than the loose bound we gave
for the convenience of an easier proof.
How much can the regex $r^* \backslash s$ grow? 
As  earlier graphs have shown,
%TODO: reference that graph where size grows quickly
they can grow at a maximum speed
exponential $w.r.t$ the number of characters, 
but will eventually level off when the string $s$ is long enough.
If they grow to a size exponential $w.r.t$ the original regex, our algorithm
would still be slow.
And unfortunately, we have concrete examples
where such regular expressions grew exponentially large before levelling off:
\begin{center}
$(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + 
(\underbrace{a \ldots a}_{\text{n a's}})^*)^*$ 
\end{center}
will already have a maximum
size that is  exponential on the number $n$ 
under our current simplification rules:
%TODO: graph of a regex whose size increases exponentially.
\begin{center}
	\begin{tikzpicture}
		\begin{axis}[
			height=0.5\textwidth,
			width=\textwidth,
			xlabel=number of a's,
			xtick={0,...,9},
			ylabel=maximum size,
			ymode=log,
			log basis y={2}
			]
			\addplot[mark=*,blue] table {re-chengsong.data};
		\end{axis}
	\end{tikzpicture}
\end{center}

For convenience we use $(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$
to express $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + 
(\underbrace{a \ldots a}_{\text{n a's}})^*$ in the below discussion.
The exponential size is triggered by that the regex
$\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$
inside the $(\ldots) ^*$ having exponentially many
different derivatives, despite those difference being minor.
$(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*\backslash \underbrace{a \ldots a}_{\text{m a's}}$
will therefore contain the following terms (after flattening out all nested 
alternatives):
\begin{center}
$(\sum_{i = 1}^{n}  (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\
[1mm]
	$(1 \leq m' \leq m )$
\end{center}
There at at least exponentially
many such terms.\footnote{To be exact, these terms are 
distinct for $m' \leq L.C.M.(1, \ldots, n)$, the details are omitted,
but the point is that the number is exponential.
} 
With each new input character taking the derivative against the intermediate result, more and more such distinct
terms will accumulate.
The function $\textit{distinctBy}$ will not be able to de-duplicate any two of these terms 
\begin{center}
$(\sum_{i = 1}^{n}  
(\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot  
(\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot 
(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\
$(\sum_{i = 1}^{n}  (\underbrace{a \ldots a}_{\text{((i - (m'' \% i))\%i) a's}})\cdot  
(\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot 
(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$
\end{center}
\noindent
where $m' \neq m''$
as they are slightly different.
This means that with our current simplification methods,
we will not be able to control the derivative so that
$\llbracket \bderssimp{r}{s} \rrbracket$ stays polynomial. %\leq O((\llbracket r\rrbacket)^c)$
These terms are similar in the sense that the head of those terms
are all consisted of sub-terms of the form: 
$(\underbrace{a \ldots a}_{\text{j a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* $.
For  $\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$, there will be at most
$n * (n + 1) / 2$ such terms. 
For example, $(a^* + (aa)^* + (aaa)^*) ^*$'s derivatives
can be described by 6 terms:
$a^*$, $a\cdot (aa)^*$, $ (aa)^*$, 
$aa \cdot (aaa)^*$, $a \cdot (aaa)^*$, and $(aaa)^*$.
The total number of different "head terms",  $n * (n + 1) / 2$,
is proportional to the number of characters in the regex 
$(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$.
If we can improve our deduplication process so that it becomes smarter
and only keep track of these $n * (n+1) /2$ terms, then we can keep
the size growth polynomial again.
This example also suggests a slightly different notion of size, which we call the 
alphabetic width:
\begin{center}
	\begin{tabular}{lcl}
		$\textit{awidth} \; \ZERO$ & $\dn$ & $0$\\
		$\textit{awidth} \; \ONE$ & $\dn$ & $0$\\
		$\textit{awidth} \; c$ & $\dn$ & $1$\\
		$\textit{awidth} \; r_1 + r_2$ & $\dn$ & $\textit{awidth} \; 
		r_1 + \textit{awidth} \; r_2$\\
		$\textit{awidth} \; r_1 \cdot r_2$ & $\dn$ & $\textit{awidth} \;
		r_1 + \textit{awidth} \; r_2$\\
		$\textit{awidth} \; r^*$ & $\dn$ & $\textit{awidth} \; r$\\
	\end{tabular}
\end{center}



Antimirov\parencite{Antimirov95} has proven that 
$\textit{PDER}_{UNIV}(r) \leq \textit{awidth}(r)$,
where $\textit{PDER}_{UNIV}(r)$ is a set of all possible subterms
created by doing derivatives of $r$ against all possible strings.
If we can make sure that at any moment in our lexing algorithm our 
intermediate result hold at most one copy of each of the 
subterms then we can get the same bound as Antimirov's.
This leads to the algorithm in the next chapter.





%----------------------------------------------------------------------------------------
%	SECTION 1
%----------------------------------------------------------------------------------------


%-----------------------------------
%	SUBSECTION 1
%-----------------------------------
%\subsection{Syntactic Equivalence Under $\simp$}
%We prove that minor differences can be annhilated
%by $\simp$.
%For example,
%\begin{center}
%	$\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) = 
%	\simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$
%\end{center}