diff -r 37b6fd310a16 -r 61139fdddae0 ChengsongTanPhdThesis/Chapters/Finite.tex
--- a/ChengsongTanPhdThesis/Chapters/Finite.tex	Sat Oct 01 12:06:46 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Finite.tex	Mon Oct 03 02:08:49 2022 +0100
@@ -402,7 +402,8 @@
 \end{center}
 \noindent
 We do not define an r-regular expression version of $\blexersimp$,
-as our proof does not involve its use. 
+as our proof does not involve its use 
+(and there is no bitcode to decode into a lexical value). 
 Everything about the size of annotated regular expressions
 can be calculated via the size of r-regular expressions:
 \begin{lemma}\label{sizeRelations}
@@ -433,7 +434,7 @@
 \end{center}
 Unless stated otherwise in the rest of this 
 chapter all regular expressions without
-bitcodes are seen as $\rrexp$s.
+bitcodes are seen as r-regular expressions ($\rrexp$s).
 For the binary alternative r-regular expression $\RALTS{[r_1, r_2]}$,
 we use the notation $r_1 + r_2$
 for brevity.
@@ -476,10 +477,10 @@
 %\end{quote}
 %\noindent
 %We explain in detail how we reached those claims.
-\subsection{Basic Properties needed for Closed Forms}
+\subsection{The Idea Behind Closed Forms}
 If we attempt to prove 
 \begin{center}
-	$\forall r. \; \exists N_r.\;\; s.t. \llbracket r\backslash_{bsimps} s \rrbracket_r \leq N_r$
+	$\forall r. \; \exists N_r.\;\; s.t. \llbracket r\backslash_{rsimps} s \rrbracket_r \leq N_r$
 \end{center}
 using a naive induction on the structure of $r$,
 then we are stuck at the inductive cases such as
@@ -487,51 +488,89 @@
 The inductive hypotheses are:
 \begin{center}
 	1: $\text{for } r_1, \text{there exists } N_{r_1}.\;\; s.t. 
-	\;\;\forall s.  \llbracket r_1 \backslash_{bsimps} s \rrbracket_r \leq N_{r_1}. $\\
+	\;\;\forall s.  \llbracket r_1 \backslash_{rsimps} s \rrbracket_r \leq N_{r_1}. $\\
 	2: $\text{for } r_2, \text{there exists } N_{r_2}.\;\; s.t. 
-	\;\; \forall s. \llbracket r_2 \backslash_{bsimps} s \rrbracket_r \leq N_{r_2}. $
+	\;\; \forall s. \llbracket r_2 \backslash_{rsimps} s \rrbracket_r \leq N_{r_2}. $
 \end{center}
 The inductive step to prove would be 
 \begin{center}
 	$\text{there exists } N_{r_1\cdot r_2}. \;\; s.t. \forall s. 
-	\llbracket (r_1 \cdot r_2) \backslash_{bsimps} s \rrbracket_r \leq N_{r_1\cdot r_2}.$
+	\llbracket (r_1 \cdot r_2) \backslash_{rsimps} s \rrbracket_r \leq N_{r_1\cdot r_2}.$
 \end{center}
 The problem is that it is not clear what 
-$(r_1\cdot r_2) \backslash_{bsimps} s$ looks like,
+$(r_1\cdot r_2) \backslash_{rsimps} s$ looks like,
 and therefore $N_{r_1}$ and $N_{r_2}$ in the
 inductive hypotheses cannot be directly used.
+We have already seen that $(r_1 \cdot r_2)\backslash s$ 
+and $(r^*)\backslash s$ can grow in a wild way.
+The point is that they will be equivalent to a list of
+terms $\sum rs$, where each term in $rs$ will
+be made of $r_1 \backslash s' $, $r_2\backslash s'$,
+and $r \backslash s'$ with $s' \in \textit{SubString} \; s$.
+The list $\sum rs$ will then be de-duplicated by $\textit{rdistinct}$
+in the simplification which saves $rs$ from growing indefinitely.
+
+Based on this idea, we sketch a proof by first showing the equality (where
+$f$ and $g$ are functions that do not increase the size of the input)
+\begin{center}
+$(r_1 \cdot r_2)\backslash_{rsimps} s = f\; (\textit{rdistinct} \; (g\; \sum rs))$,
+\end{center}
+and then show the right-hand-side can be finitely bounded.
+We call the right-hand-side the 
+\emph{Closed Form} of $(r_1 \cdot r_2)\backslash_{rsimps} s$.
+We will flesh out the proof sketch in the next section.
+
+\section{Details of Closed Forms and Bounds}
+In this section we introduce in detail
+how the closed forms are obtained for regular expressions'
+derivatives and how they are bounded.
+We start by proving some basic identities
+involving the simplification functions for r-regular expressions.
+After that we use these identities to establish the
+closed forms we need.
+Finally, we prove the functions such as $\flts$
+will keep the size non-increasing.
+Putting this together with a general bound 
+on the finiteness of distinct regular expressions
+smaller than a certain size, we obtain a bound on 
+the closed forms.
+%$r_1\cdot r_2$, $r^*$ and $\sum rs$.
 
 
-The next steps involves getting a closed form for
-$\rderssimp{r}{s}$ and then obtaining
-an estimate for the closed form.
+
+\subsection{Some Basic Identities}
+
+
 \subsubsection{$\textit{rdistinct}$'s Deduplicates Successfully}
 The $\textit{rdistinct}$ function, as its name suggests, will
-remove duplicates in an \emph{r}$\textit{rexp}$ list, 
-according to the accumulator
-and leave only one of each different element in a list:
+remove duplicates in an r-regular expression list.
+It will also correctly exclude any elements that 
+is intially in the accumulator set.
 \begin{lemma}\label{rdistinctDoesTheJob}
-	The function $\textit{rdistinct}$ satisfies the following
-	properties:
+	%The function $\textit{rdistinct}$ satisfies the following
+	%properties:
+	Assume we have the predicate $\textit{isDistinct}$\footnote{We omit its
+	recursive definition here, its Isabelle counterpart would be $\textit{distinct}$.} 
+	readily defined
+	for testing
+	whether a list's elements are all unique. Then the following
+	properties about $\textit{rdistinct}$ hold:
 	\begin{itemize}
 		\item
 			If $a \in acc$ then $a \notin (\rdistinct{rs}{acc})$.
 		\item
-			If list $rs'$ is the result of $\rdistinct{rs}{acc}$,
-			then $\textit{isDistinct} \; rs'$.
+			%If list $rs'$ is the result of $\rdistinct{rs}{acc}$,
+			$\textit{isDistinct} \;\;\; (\rdistinct{rs}{acc})$.
 		\item
-			$\rdistinct{rs}{acc} = rs - acc$
+			$\textit{set} \; (\rdistinct{rs}{acc}) 
+			= (textit{set} \; rs) - acc$
 	\end{itemize}
 \end{lemma}
 \noindent
-The predicate $\textit{isDistinct}$ is for testing
-whether a list's elements are all unique. It is defined
-recursively on the structure of a regular expression,
-and we omit the precise definition here.
 \begin{proof}
 	The first part is by an induction on $rs$.
 	The second and third part can be proven by using the 
-	induction rules of $\rdistinct{\_}{\_}$.
+	inductive cases of $\textit{rdistinct}$.
 
 \end{proof}
 
@@ -541,7 +580,7 @@
 list $rs$ whose elements are all from the accumulator, and then call $\rdistinct{\_}{\_}$
 on the resulting list, the output will be as if we had called $\rdistinct{\_}{\_}$
 without the prepending of $rs$:
-\begin{lemma}
+\begin{lemma}\label{rdistinctConcat}
 	The elements appearing in the accumulator will always be removed.
 	More precisely,
 	\begin{itemize}
@@ -549,13 +588,13 @@
 			If $rs \subseteq rset$, then 
 			$\rdistinct{rs@rsa }{acc} = \rdistinct{rsa }{acc}$.
 		\item
-			Furthermore, if $a \in rset$ and $\rdistinct{rs}{\{a\}} = []$,
+			More generally, if $a \in rset$ and $\rdistinct{rs}{\{a\}} = []$,
 			then $\rdistinct{(rs @ rs')}{rset} = \rdistinct{rs'}{rset}$
 	\end{itemize}
 \end{lemma}
 
 \begin{proof}
-	By induction on $rs$.
+	By induction on $rs$ and using \ref{rdistinctDoesTheJob}.
 \end{proof}
 \noindent
 On the other hand, if an element $r$ does not appear in the input list waiting to be deduplicated,
@@ -964,9 +1003,10 @@
 	By induction on the lists involved.
 \end{proof}
 \noindent
-Similarly,
-we introduce the equality for $\sum$ when certain child regular expressions
-are $\sum$ themselves:
+The above allows us to prove
+two similar equalities (which are a bit more involved).
+It says that we could flatten out the elements
+before simplification and still get the same result.
 \begin{lemma}\label{simpFlatten3}
 	One can flatten the inside $\sum$ of a $\sum$ if it is being 
 	simplified. Concretely,
@@ -1279,7 +1319,11 @@
 Before we go on to obtain them, some preliminary definitions
 are needed to make proof statements concise.
 
-\section{"Closed Forms" of Sequence Regular Expressions}
+
+
+
+\subsection{Closed Forms}
+\subsubsection{Closed Form for Sequence Regular Expressions}
 The problem of obataining a closed-form for sequence regular expression 
 is constructing $(r_1 \cdot r_2) \backslash_r s$
 if we are only allowed to use a combination of $r_1 \backslash s''$ 
@@ -1466,7 +1510,7 @@
 	\end{center}
 \end{corollary}
 \noindent
-\subsection{Closed Forms for Star Regular Expressions}
+\subsubsection{Closed Forms for Star Regular Expressions}
 We have shown how to control the size of the sequence regular expression $r_1\cdot r_2$ using
 the "closed form" of $(r_1 \cdot r_2) \backslash s$ and then 
 the property of the $\distinct$ function.
@@ -1683,7 +1727,13 @@
 	The lemmas \ref{rsimpIdem}, \ref{starHfauInduct}, and \ref{hfauRsimpeq2}
 	are used.	
 \end{proof}
-\section{Estimating the Closed Forms' sizes}
+
+
+
+
+
+
+\subsection{Estimating the Closed Forms' sizes}
 We now summarize the closed forms below:
 \begin{itemize}
 	\item
@@ -1758,15 +1808,94 @@
 is bounded by $N$.
 We want to apply it to our setting $\rsize{\rsimp{\sum rs}}$.
 
-We show how $\rdistinct$ and $\rflts$
-in the simplification function together is at least as 
-good as $\rdistinct{}{}$ alone.
-\begin{lemma}\label{interactionFltsDB}
+We show that $\rdistinct$ and $\rflts$
+working together is at least as 
+good as $\rdistinct{}{}$ alone, which can be written as
+\begin{center}
+	$\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r 
+	\leq 
+	\llbracket \rdistinct{rs}{\varnothing}  \rrbracket_r  $.
+\end{center}
+We need this so that we know the outcome of our real 
+simplification is better than or equal to a rough estimate,
+and therefore can be bounded by that estimate.
+This is a bit harder to establish compared with proving
+$\textit{flts}$ does not make a list larger (which can
+be proven using routine induction):
+\begin{center}
+	$\llbracket  \textit{rflts}\; rs \rrbracket_r \leq
+	\llbracket  \textit{rs} \rrbracket_r$
+\end{center}
+We cannot simply prove how each helper function
+reduces the size and then put them together:
+From
+\begin{center}
+$\llbracket  \textit{rflts}\; rs \rrbracket_r \leq
+	\llbracket \; \textit{rs} \rrbracket_r$
+\end{center}
+and
+\begin{center}
+     $\llbracket  \textit{rdistinct} \; rs \; \varnothing \leq
+     \llbracket rs \rrbracket_r$
+\end{center}
+one cannot imply
+\begin{center}
 	$\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r 
 	\leq 
 	\llbracket \rdistinct{rs}{\varnothing}  \rrbracket_r  $.
+\end{center}
+What we can imply is that 
+\begin{center}
+	$\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r 
+	\leq
+	\llbracket rs \rrbracket_r$
+\end{center}
+but this estimate is too rough and $\llbracket rs \rrbracket_r$	is unbounded.
+The way we 
+get through this is by first proving a more general lemma 
+(so that the inductive case goes through):
+\begin{lemma}\label{fltsSizeReductionAlts}
+	If we have three accumulator sets:
+	$noalts\_set$, $alts\_set$ and $corr\_set$,
+	satisfying:
+	\begin{itemize}
+		\item
+			$\forall r \in noalts\_set. \; \nexists xs.\; r = \sum  xs$
+		\item
+			$\forall r \in alts\_set. \; \exists xs. \; r = \sum xs
+			\; \textit{and} \; set \; xs \subseteq corr\_set$
+	\end{itemize}
+	then we have that
+	\begin{center}
+	\begin{tabular}{lcl}
+	$\llbracket  (\textit{rdistinct} \; (\textit{rflts} \; as) \;
+	(noalts\_set \cup corr\_set)) \rrbracket_r$ & $\leq$ &\\
+						    $\llbracket  (\textit{rdistinct} \; as \; (noalts\_set \cup alts\_set \cup
+	\{ \ZERO \} )) \rrbracket_r$ & & \\ 
+	\end{tabular}
+	\end{center}
+		holds.
 \end{lemma}
 \noindent
+We need to split the accumulator into two parts: the part
+which contains alternative regular expressions ($alts\_set$), and 
+the part without any of them($noalts\_set$).
+The set $corr\_set$ is the corresponding set
+of $alts\_set$ with all elements under the $\sum$ constructor
+spilled out.
+\begin{proof}
+	By induction on the list $as$. We make use of lemma \ref{rdistinctConcat}.
+\end{proof}
+By setting all three sets to the empty set, one gets the desired size estimate:
+\begin{corollary}\label{interactionFltsDB}
+	$\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r 
+	\leq 
+	\llbracket \rdistinct{rs}{\varnothing}  \rrbracket_r  $.
+\end{corollary}
+\begin{proof}
+	By using the lemma \ref{fltsSizeReductionAlts}.
+\end{proof}
+\noindent
 The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of 
 duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$. 
 
@@ -1778,13 +1907,10 @@
 	By using \ref{interactionFltsDB}.
 \end{proof}
 \noindent
-which says that the size of regular expression
-is always smaller if we apply the full simplification
-rather than just one component ($\rdistinct{}{}$).
-
-
-Now we are ready to control the sizes of
-$r_1 \cdot r_2 \backslash s$, $r^* \backslash s$.
+This is a key lemma in establishing the bounds on all the 
+closed forms.
+With this we are now ready to control the sizes of
+$(r_1 \cdot r_2 )\backslash s$, $r^* \backslash s$.
 \begin{theorem}
 	For any regex $r$, $\exists N_r. \forall s. \; \rsize{\rderssimp{r}{s}} \leq N_r$
 \end{theorem}
@@ -1893,6 +2019,10 @@
 \end{proof}
 \noindent
 
+
+
+
+
 %-----------------------------------
 %	SECTION 2
 %-----------------------------------
@@ -2062,53 +2192,54 @@
 %----------------------------------------------------------------------------------------
 %	SECTION ALTS CLOSED FORM
 %----------------------------------------------------------------------------------------
-\section{A Closed Form for \textit{ALTS}}
-Now we prove that  $rsimp (rders\_simp (RALTS rs) s) = rsimp (RALTS (map (\lambda r. rders\_simp r s) rs))$.
-
-
-There are a few key steps, one of these steps is
-
-
-
-One might want to prove this by something a simple statement like: 
-
-For this to hold we want the $\textit{distinct}$ function to pick up
-the elements before and after derivatives correctly:
-$r \in rset \equiv (rder x r) \in (rder x rset)$.
-which essentially requires that the function $\backslash$ is an injective mapping.
-
-Unfortunately the function $\backslash c$ is not an injective mapping.
+%\section{A Closed Form for \textit{ALTS}}
+%Now we prove that  $rsimp (rders\_simp (RALTS rs) s) = rsimp (RALTS (map (\lambda r. rders\_simp r s) rs))$.
+%
+%
+%There are a few key steps, one of these steps is
+%
+%
+%
+%One might want to prove this by something a simple statement like: 
+%
+%For this to hold we want the $\textit{distinct}$ function to pick up
+%the elements before and after derivatives correctly:
+%$r \in rset \equiv (rder x r) \in (rder x rset)$.
+%which essentially requires that the function $\backslash$ is an injective mapping.
+%
+%Unfortunately the function $\backslash c$ is not an injective mapping.
+%
+%\subsection{function $\backslash c$ is not injective (1-to-1)}
+%\begin{center}
+%	The derivative $w.r.t$ character $c$ is not one-to-one.
+%	Formally,
+%	$\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$
+%\end{center}
+%This property is trivially true for the
+%character regex example:
+%\begin{center}
+%	$r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$
+%\end{center}
+%But apart from the cases where the derivative
+%output is $\ZERO$, are there non-trivial results
+%of derivatives which contain strings?
+%The answer is yes.
+%For example,
+%\begin{center}
+%	Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\
+%	where $a$ is not nullable.\\
+%	$r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\
+%	$r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$
+%\end{center}
+%We start with two syntactically different regular expressions,
+%and end up with the same derivative result.
+%This is not surprising as we have such 
+%equality as below in the style of Arden's lemma:\\
+%\begin{center}
+%	$L(A^*B) = L(A\cdot A^* \cdot B + B)$
+%\end{center}
 
-\subsection{function $\backslash c$ is not injective (1-to-1)}
-\begin{center}
-	The derivative $w.r.t$ character $c$ is not one-to-one.
-	Formally,
-	$\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$
-\end{center}
-This property is trivially true for the
-character regex example:
-\begin{center}
-	$r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$
-\end{center}
-But apart from the cases where the derivative
-output is $\ZERO$, are there non-trivial results
-of derivatives which contain strings?
-The answer is yes.
-For example,
-\begin{center}
-	Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\
-	where $a$ is not nullable.\\
-	$r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\
-	$r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$
-\end{center}
-We start with two syntactically different regular expressions,
-and end up with the same derivative result.
-This is not surprising as we have such 
-equality as below in the style of Arden's lemma:\\
-\begin{center}
-	$L(A^*B) = L(A\cdot A^* \cdot B + B)$
-\end{center}
-
+\section{Further Improvements to the Bound}
 There are two problems with this finiteness result, though.
 \begin{itemize}
 	\item