lexing: changeset 619:2072a8d54e3e

--- a/ChengsongTanPhdThesis/Chapters/Finite.tex	Thu Oct 13 23:55:25 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Finite.tex	Sun Nov 06 23:06:10 2022 +0000
@@ -7,53 +7,68 @@
 %of our bitcoded algorithm, that is a finite bound on the size of any 
 %regex's derivatives. 
 
-\begin{figure}
-\begin{center}
-	\begin{tabular}{ccc}
-		$\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\
-		$\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\
-		$\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\
-		$\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\
-		$\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as   + 1$\\
-		$\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$.
-	\end{tabular}
-\end{center}
-\caption{The size function of bitcoded regular expressions}\label{brexpSize}
-\end{figure}
-In this chapter we give a guarantee in terms of size: 
+
+In this chapter we give a guarantee in terms of size of derivatives: 
 given an annotated regular expression $a$, for any string $s$
 our algorithm $\blexersimp$'s internal annotated regular expression 
 size  is finitely bounded
-by a constant $N_a$ that only depends on $a$:
+by a constant  that only depends on $a$.
+Formally we show that there exists an $N_a$ such that
 \begin{center}
 	$\llbracket \bderssimp{a}{s} \rrbracket \leq N_a$
 \end{center}
 \noindent
-where the size of an annotated regular expression is defined
-in terms of the number of nodes in its tree structure (see figure \ref{brexpSize}).
+where the size ($\llbracket \_ \rrbracket$) of 
+an annotated regular expression is defined
+in terms of the number of nodes in its 
+tree structure (its recursive definition given in the next page).
 We believe this size bound
 is important in the context of POSIX lexing, because 
 \begin{itemize}
 	\item
-		It is a stepping stone towards an ``absence of catastrophic-backtracking''
-		guarantee. 
-		If the internal data structures used by our algorithm cannot grow very large,
-		then our algorithm (which traverses those structures) cannot be too slow.
+		It is a stepping stone towards the goal 
+		of eliminating ``catastrophic backtracking''. 
+		If the internal data structures used by our algorithm
+		grows beyond a finite bound, then clearly 
+		the algorithm (which traverses these structures) will
+		be slow.
 		The next step would be to refine the bound $N_a$ so that it
 		is polynomial on $\llbracket a\rrbracket$.
 	\item
-		Having it formalised gives us a higher confidence that
+		Having the finite bound formalised 
+		gives us a higher confidence that
 		our simplification algorithm $\simp$ does not ``mis-behave''
 		like $\simpsulz$ does.
-		The bound is universal, which is an advantage over work which 
-		only gives empirical evidence on some test cases.
+		The bound is universal for a given regular expression, 
+		which is an advantage over work which 
+		only gives empirical evidence on some test cases (see Verbatim++\cite{Verbatimpp}).
 \end{itemize}
+In the next section we describe in more detail
+what the finite bound means in our algorithm
+and why the size of the internal data structures of
+a typical derivative-based lexer such as
+Sulzmann and Lu's need formal treatment.
+
 \section{Formalising About Size}
 \noindent
 In our lexer ($\blexersimp$),
 we take an annotated regular expression as input,
-and repeately take derivative of and simplify it:
-\begin{figure}[H]
+and repeately take derivative of and simplify it.
+\begin{figure}
+	\begin{center}
+		\begin{tabular}{lcl}
+			$\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\
+			$\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\
+			$\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\
+			$\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\
+			$\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as   + 1$\\
+			$\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$.
+		\end{tabular}
+	\end{center}
+	\caption{The size function of bitcoded regular expressions}\label{brexpSize}
+\end{figure}
+
+\begin{figure}
 	\begin{tikzpicture}[scale=2,
 		every node/.style={minimum size=11mm},
 		->,>=stealth',shorten >=1pt,auto,thick
@@ -79,15 +94,16 @@
 	\end{tikzpicture}
 	\caption{Regular expression size change during our $\blexersimp$ algorithm}\label{simpShrinks}
 \end{figure}
+
 \noindent
 Each time
 a derivative is taken, the regular expression might grow.
 However, the simplification that is immediately afterwards will always shrink it so that 
-it stays small.
+it stays relatively small.
 This intuition is depicted by the relative size
 change between the black and blue nodes:
 After $\simp$ the node always shrinks.
-Our proof says that all the blue nodes
+Our proof states that all the blue nodes
 stay below a size bound $N_a$ determined by the input $a$.
 
 \noindent
@@ -126,7 +142,7 @@
 \noindent
 The picture means that on certain cases their lexer (where they use $\simpsulz$ 
 as the simplification function)
-will have an indefinite size explosion, causing the running time 
+will have a size explosion, causing the running time 
 of each derivative step to grow continuously (for example 
 in \ref{SulzmannLuLexerTime}).
 They tested out the run time of their
@@ -137,13 +153,13 @@
 
 Before delving in to the details of the formalisation,
 we are going to provide an overview of it.
-In the next subsection, we draw a picture of the bird's eye view
+In the next subsection, we give a high-level view
 of the proof.
 
 
 \subsection{Overview of the Proof}
-Here is a bird's eye view of the main components of the finiteness proof,
-which involves three steps:
+A high-level overview of the main components of the finiteness proof
+is as follows:
 \begin{figure}[H]
 	\begin{tikzpicture}[scale=1,font=\bf,
 		node/.style={
@@ -179,21 +195,23 @@
 	\item
 		We first introduce the operations such as 
 		derivatives, simplification, size calculation, etc.
-		associated with $\rrexp$s, which we have given
-		a very brief introduction to in chapter \ref{Bitcoded2}.
+		associated with $\rrexp$s, which we have introduced
+		in chapter \ref{Bitcoded2}.
 		The operations on $\rrexp$s are identical to those on
-		annotated regular expressions except that they are unaware
-		of bitcodes. This means that all proofs about size of $\rrexp$s will apply to
-		annotated regular expressions.
+		annotated regular expressions except that they dispense with
+		bitcodes. This means that all proofs about size of $\rrexp$s will apply to
+		annotated regular expressions, because the size of a regular
+		expression is independent of the bitcodes.
 	\item
 		We prove that $\rderssimp{r}{s} = \textit{ClosedForm}(r, s)$,
 		where $\textit{ClosedForm}(r, s)$ is entirely 
-		written in the derivatives of their children regular 
+		given as the derivatives of their children regular 
 		expressions.
 		We call the right-hand-side the \emph{Closed Form}
 		of the derivative $\rderssimp{r}{s}$.
 	\item
-		We estimate $\llbracket \textit{ClosedForm}(r, s) \rrbracket_r$.
+		Formally we give an estimate of 
+		$\llbracket \textit{ClosedForm}(r, s) \rrbracket_r$.
 		The key observation is that $\distinctBy$'s output is
 		a list with a constant length bound.
 \end{itemize}
@@ -202,14 +220,14 @@
 \section{The $\textit{Rrexp}$ Datatype}
 The first step is to define 
 $\textit{rrexp}$s.
-They are without bitcodes,
+They are annotated regular expressions without bitcodes,
 allowing a much simpler size bound proof.
-Of course, the bits which encode the lexing information 
-would grow linearly with respect 
-to the input, which should be taken into account when we wish to tackle the runtime comlexity.
-But for the sake of the structural size 
-we can safely ignore them.\\
-To recapitulate, the datatype 
+%Of course, the bits which encode the lexing information 
+%would grow linearly with respect 
+%to the input, which should be taken into account when we wish to tackle the runtime comlexity.
+%But for the sake of the structural size 
+%we can safely ignore them.\\
+The datatype 
 definition of the $\rrexp$, called
 \emph{r-regular expressions},
 was initially defined in \ref{rrexpDef}.
@@ -226,7 +244,7 @@
 written $\llbracket r\rrbracket_r$, 
 whose definition mirrors that of an annotated regular expression. 
 \begin{center}
-	\begin{tabular}{ccc}
+	\begin{tabular}{lcl}
 		$\llbracket _{bs}\ONE \rrbracket_r$ & $\dn$ & $1$\\
 		$\llbracket \ZERO \rrbracket_r$ & $\dn$ & $1$ \\
 		$\llbracket _{bs} r_1 \cdot r_2 \rrbracket_r$ & $\dn$ & $\llbracket r_1 \rrbracket_r + \llbracket r_2 \rrbracket_r + 1$\\
@@ -254,27 +272,31 @@
 \end{center}
 
 \subsection{Why a New Datatype?}
-The reason we take all the trouble 
-defining a new datatype is that $\erase$ makes things harder.
+The reason we
+define a new datatype is that 
+the $\erase$ function 
+does not preserve the structure of annotated
+regular expressions.
 We initially started by using 
 plain regular expressions and tried to prove
-the lemma \ref{rsizeAsize},
-however the $\erase$ function unavoidbly messes with the structure of the 
+lemma \ref{rsizeAsize},
+however the $\erase$ function messes with the structure of the 
 annotated regular expression.
 The $+$ constructor
-of basic regular expressions is binary whereas $\sum$ 
-takes a list, and one has to convert between them:
+of basic regular expressions is only binary, whereas $\sum$ 
+takes a list. Therefore we need to convert between
+annotated and normal regular expressions as follows:
 \begin{center}
-	\begin{tabular}{ccc}
+	\begin{tabular}{lcl}
 		$\erase \; _{bs}\sum [] $ & $\dn$ & $\ZERO$\\
 		$\erase \; _{bs}\sum [a]$ & $\dn$ & $a$\\
 		$\erase \; _{bs}\sum a :: as$ & $\dn$ & $a + (\erase \; _{[]} \sum as)\quad \text{if $as$ length over 1}$
 	\end{tabular}
 \end{center}
 \noindent
-An alternative regular expression with an empty argument list
+As can be seen alternative regular expression with an empty argument list
 will be turned into a $\ZERO$.
-The singleton alternative $\sum [r]$ would have $r$ during the
+The singleton alternative $\sum [r]$ becomes $r$ during the
 $\erase$ function.
 The  annotated regular expression $\sum[a, b, c]$ would turn into
 $(a+(b+c))$.
@@ -284,7 +306,7 @@
 For example, if we define the size of a basic plain regular expression 
 in the usual way,
 \begin{center}
-	\begin{tabular}{ccc}
+	\begin{tabular}{lcl}
 		$\llbracket \ONE \rrbracket_p$ & $\dn$ & $1$\\
 		$\llbracket \ZERO \rrbracket_p$ & $\dn$ & $1$ \\
 		$\llbracket r_1 \cdot r_2 \rrbracket_p$ & $\dn$ & $\llbracket r_1 \rrbracket_p + \llbracket r_2 \rrbracket_p + 1$\\
@@ -303,7 +325,7 @@
 only the bitcodes are thrown away.
 Everything about the structure remains intact.
 Therefore it does not change the size
-of an annotated regular expression:
+of an annotated regular expression and we have:
 \begin{lemma}\label{rsizeAsize}
 	$\rsize{\rerase a} = \asize a$
 \end{lemma}
@@ -317,17 +339,17 @@
 but we found our approach more straightforward.\\
 
 \subsection{Functions for R-regular Expressions}
-We shall define the r-regular expression version
-of $\blexer$ and $\blexersimp$ related functions.
+In this section we shall define the r-regular expression version
+of $\blexer$, and $\blexersimp$ related functions.
 We use $r$ as the prefix or subscript to differentiate
 with the bitcoded version.
-For example,$\backslash_r$, $\rdistincts$, and $\rsimp$
-as opposed to $\backslash$, $\distinctBy$, and $\bsimp$.
-As promised, they are much simpler than their bitcoded counterparts.
+%For example,$\backslash_r$, $\rdistincts$, and $\rsimp$
+%as opposed to $\backslash$, $\distinctBy$, and $\bsimp$.
+%As promised, they are much simpler than their bitcoded counterparts.
 %The operations on r-regular expressions are 
 %almost identical to those of the annotated regular expressions,
 %except that no bitcodes are used. For example,
-The derivative operation becomes simpler:\\
+The derivative operation for an r-regular expression is\\
 \begin{center}
 	\begin{tabular}{@{}lcl@{}}
 		$(\ZERO)\,\backslash_r c$ & $\dn$ & $\ZERO$\\  
@@ -347,10 +369,12 @@
 	\end{tabular}    
 \end{center}  
 \noindent
-Similarly, $\distinctBy$ does not need 
+where we omit the definition of $\textit{rnullable}$.
+
+The function $\distinctBy$ for r-regular expressions does not need 
 a function checking equivalence because
-there are no bit annotations causing superficial differences
-between syntactically equal terms.
+there are no bit annotations.
+Therefore we have
 \begin{center}
 	\begin{tabular}{lcl}
 		$\rdistinct{[]}{rset} $ & $\dn$ & $[]$\\
@@ -362,14 +386,14 @@
 \end{center}
 %TODO: definition of rsimp (maybe only the alternative clause)
 \noindent
-We would like to make clear
-a difference between our $\rdistincts$ and
-the Isabelle $\textit {distinct}$ predicate.
-In Isabelle $\textit{distinct}$ is a function that returns a boolean
-rather than a list.
-It tests if all the elements of a list are unique.\\
-With $\textit{rdistinct}$,
-and the flatten function for $\rrexp$s:
+%We would like to make clear
+%a difference between our $\rdistincts$ and
+%the Isabelle $\textit {distinct}$ predicate.
+%In Isabelle $\textit{distinct}$ is a function that returns a boolean
+%rather than a list.
+%It tests if all the elements of a list are unique.\\
+With $\textit{rdistinct}$ in place,
+the flatten function for $\rrexp$ is as follows:
  \begin{center}
   \begin{tabular}{@{}lcl@{}}
   $\textit{rflts} \; (\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $as \; @ \; \textit{rflts} \; as' $ \\
@@ -378,8 +402,8 @@
 \end{tabular}    
 \end{center}  
 \noindent
-one can chain together all the other modules
-such as $\rsimpalts$:
+The function 
+$\rsimpalts$ corresponds to $\textit{bsimp}_{ALTS}$:
 \begin{center}
   \begin{tabular}{@{}lcl@{}}
 	  $\rsimpalts \;\; nil$ & $\dn$ & $\RZERO$\\
@@ -388,7 +412,7 @@
 \end{tabular}    
 \end{center}  
 \noindent
-and $\rsimpseq$:
+Similarly, we have $\rsimpseq$ which corresponds to $\textit{bsimp}_{SEQ}$:
 \begin{center}
   \begin{tabular}{@{}lcl@{}}
 	  $\rsimpseq \;\; \RZERO \; \_ $ &   $=$ &   $\RZERO$\\
@@ -420,8 +444,7 @@
 \end{center}
 \noindent
 We do not define an r-regular expression version of $\blexersimp$,
-as our proof does not involve its use 
-(and there is no bitcode to decode into a lexical value). 
+as our proof does not depend on it.
 Now we are ready to introduce how r-regular expressions allow
 us to prove the size bound on bitcoded regular expressions.
 
@@ -431,7 +454,7 @@
 can be calculated via the size of r-regular expressions after the application
 of $\rsimp$ and $\backslash_{rsimps}$:
 \begin{lemma}\label{sizeRelations}
-	The following equalities hold:
+	The following two equalities hold:
 	\begin{itemize}
 		\item
 			$\asize{\bsimps \; a} = \rsize{\rsimp{ \rerase{a}}}$
@@ -456,13 +479,13 @@
 	implies
 	$\quad \llbracket a \backslash_{bsimps} s \rrbracket \leq N_r$.
 \end{center}
-From now on we 
-Unless stated otherwise in the rest of this 
-chapter all regular expressions without
-bitcodes are seen as r-regular expressions ($\rrexp$s).
-For the binary alternative r-regular expression $\RALTS{[r_1, r_2]}$,
-we use the notation $r_1 + r_2$
-for brevity.
+%From now on we 
+%Unless stated otherwise in the rest of this 
+%chapter all regular expressions without
+%bitcodes are seen as r-regular expressions ($\rrexp$s).
+%For the binary alternative r-regular expression $\RALTS{[r_1, r_2]}$,
+%we use the notation $r_1 + r_2$
+%for brevity.
 
 
 %-----------------------------------
@@ -521,15 +544,16 @@
 $(r_1\cdot r_2) \backslash_{rsimps} s$ looks like,
 and therefore $N_{r_1}$ and $N_{r_2}$ in the
 inductive hypotheses cannot be directly used.
-We have already seen that $(r_1 \cdot r_2)\backslash s$ 
-and $(r^*)\backslash s$ can grow in a wild way.
+%We have already seen that $(r_1 \cdot r_2)\backslash s$ 
+%and $(r^*)\backslash s$ can grow in a wild way.
 
-The point is that they will be equivalent to a list of
+The point however, is that they will be equivalent to a list of
 terms $\sum rs$, where each term in $rs$ will
 be made of $r_1 \backslash s' $, $r_2\backslash s'$,
-and $r \backslash s'$ with $s' \in \textit{SubString} \; s$.
+and $r \backslash s'$ with $s' \in \textit{SubString} \; s$ (which stands
+for the set of substrings of $s$).
 The list $\sum rs$ will then be de-duplicated by $\textit{rdistinct}$
-in the simplification which saves $rs$ from growing indefinitely.
+in the simplification, which prevents the $rs$ from growing indefinitely.
 
 Based on this idea, we develop a proof in two steps.
 First, we show the equality (where
@@ -547,23 +571,45 @@
 \emph{Closed Form} of $(r_1 \cdot r_2)\backslash_{rsimps} s$.
 Second, we will bound the closed form of r-regular expressions
 using some estimation techniques
-and then piece it together
-with lemma \ref{sizeRelations} to show the bitcoded regular expressions
+and then apply
+lemma \ref{sizeRelations} to show that the bitcoded regular expressions
 in our $\blexersimp$ are finitely bounded.
 
-We will flesh out the first step of the proof we 
-sketched just now in the next section.
+We will describe in detail the first step of the proof
+in the next section.
 
 \section{Closed Forms}
 In this section we introduce in detail
-how the closed forms are obtained for regular expressions'
-derivatives.
+how we express the string derivatives
+of regular expressions (i.e. $r \backslash_r s$ where $s$ is a string
+rather than a single character) in a different way than 
+our previous definition.
+In previous chapters, the derivative of a 
+regular expression $r$ w.r.t a string $s$
+was recursively defined on the string:
+\begin{center}
+	$r \backslash_s (c::s) \dn (r \backslash c) \backslash_s s$
+\end{center}
+The problem is that 
+this definition does not provide much information
+on what $r \backslash_s s$ looks like.
+If we are interested in the size of a derivative like 
+$(r_1 \cdot r_2)\backslash s$,
+we have to somehow get a more concrete form to begin.
+We call such more concrete representations the ``closed forms'' of
+string derivatives as opposed to their original definitions.
+The terminology ``closed form'' is borrowed from mathematics,
+which usually describe expressions that are solely comprised of
+well-known and easy-to-compute operations such as 
+additions, multiplications, exponential functions.
+
 We start by proving some basic identities
 involving the simplification functions for r-regular expressions.
 After that we introduce the rewrite relations
 $\rightsquigarrow_h$, $\rightsquigarrow^*_{scf}$
 $\rightsquigarrow_f$ and $\rightsquigarrow_g$.
-These relations involves similar techniques in chapter \ref{Bitcoded2}.
+These relations involves similar techniques as in chapter \ref{Bitcoded2}
+for annotated regular expressions.
 Finally, we use these identities to establish the
 closed forms of the alternative regular expression,
 the sequence regular expression, and the star regular expression.
@@ -573,11 +619,9 @@
 
 \subsection{Some Basic Identities}
 
-We now introduce lemmas 
-that are repeatedly used in later proofs.
-Note that for the $\textit{rdistinct}$ function there
-will be a lot of conversion from lists to sets.
-We use $set$ to refere to the 
+In what follows we will often convert between lists
+and sets.
+We use Isabelle's $set$ to refere to the 
 function that converts a list $rs$ to the set
 containing all the elements in $rs$.
 \subsubsection{$\textit{rdistinct}$'s Does the Job of De-duplication}
@@ -589,8 +633,7 @@
 	%The function $\textit{rdistinct}$ satisfies the following
 	%properties:
 	Assume we have the predicate $\textit{isDistinct}$\footnote{We omit its
-	recursive definition here, its Isabelle counterpart would be $\textit{distinct}$.} 
-	readily defined
+	recursive definition here. Its Isabelle counterpart would be $\textit{distinct}$.} 
 	for testing
 	whether a list's elements are unique. Then the following
 	properties about $\textit{rdistinct}$ hold:
@@ -675,12 +718,13 @@
 \begin{proof}
 	By induction on $rs$. All other variables are allowed to be arbitrary.
 	The second part of the lemma requires the first.
-	Note that for each part, the two sub-propositions need to be proven concurrently,
+	Note that for each part, the two sub-propositions need to be proven 
+	at the same time,
 	so that the induction goes through.
 \end{proof}
 \noindent
 This allows us to prove a few more equivalence relations involving 
-$\textit{rdistinct}$ (it will be useful later):
+$\textit{rdistinct}$ (they will be useful later):
 \begin{lemma}\label{rdistinctConcatGeneral}
 	\mbox{}
 	\begin{itemize}
@@ -723,7 +767,7 @@
 \end{proof}
 \noindent
 $\textit{rdistinct}$ needs to be applied only once, and 
-applying it multiple times does not cause any difference:
+applying it multiple times does not make any difference:
 \begin{corollary}\label{distinctOnceEnough}
 	$\textit{rdistinct} \; (rs @ rsa) {} = \textit{rdistinct} \; (rdistinct \; 
 	rs \{ \} @ (\textit{rdistinct} \; rs_a \; (set \; rs)))$
@@ -748,7 +792,7 @@
 %functions on two lists separately, and then concatenating them together.
 $\textit{Rflts}$ is composable in terms of concatenation:
 \begin{lemma}\label{rfltsProps}
-	The function $\rflts$ has the below properties:\\
+	The function $\rflts$ has the properties below:\\
 	\begin{itemize}
 		\item
 			$\rflts \; (rs_1 @ rs_2) = \rflts \; rs_1 @ \rflts \; rs_2$
@@ -779,15 +823,18 @@
 \noindent
 Now we introduce the property that the operations 
 derivative and $\rsimpalts$
-commute, this will be used later in deriving the closed form for
+commute, this will be used later on, when deriving the closed form for
 the alternative regular expression:
 \begin{lemma}\label{rderRsimpAltsCommute}
 	$\rder{x}{(\rsimpalts \; rs)} = \rsimpalts \; (\map \; (\rder{x}{\_}) \; rs)$
 \end{lemma}
+\begin{proof}
+	By induction on $rs$.
+\end{proof}
 \noindent
 
 \subsubsection{The $RL$ Function: Language Interpretation of $\textit{Rrexp}$s}
-Much like the definition of $L$ on plain regular expressions, one could also 
+Much like the definition of $L$ on plain regular expressions, one can also 
 define the language interpretation of $\rrexp$s.
 \begin{center}
 	\begin{tabular}{lcl}
@@ -820,10 +867,12 @@
 \subsubsection{Simplified $\textit{Rrexp}$s are Good}
 We formalise the notion of ``good" regular expressions,
 which means regular expressions that
-are not fully simplified. For alternative regular expressions that means they
-do not contain any nested alternatives like 
-\[ r_1 + (r_2 + r_3) \], un-removed $\RZERO$s like \[\RZERO + r\]
-or duplicate elements in a children regular expression list like \[ \sum [r, r, \ldots]\]:
+are fully simplified in terms of our $\textit{rsimp}$ function. 
+For alternative regular expressions that means they
+do not contain any nested alternatives, un-eliminated $\RZERO$s
+or duplicate elements (for example, 
+$r_1 + (r_2 + r_3)$, $\RZERO + r$ and $ \sum [r, r, \ldots]$).
+The clauses for $\good$ are:
 \begin{center}
 	\begin{tabular}{@{}lcl@{}}
 		$\good\; \RZERO$ & $\dn$ & $\textit{false}$\\
@@ -833,7 +882,7 @@
 		$\good\; \RALTS{[r]}$ & $\dn$ & $\bfalse$\\
 		$\good\; \RALTS{r_1 :: r_2 :: rs}$ & $\dn$ & 
 		$\textit{isDistinct} \; (r_1 :: r_2 :: rs) \;$\\
-						   & & $\textit{and}\; (\forall r' \in (r_1 :: r_2 :: rs).\; \good \; r'\; \,  \textit{and}\; \, \textit{nonAlt}\; r')$\\
+						   & & $\land \; (\forall r' \in (r_1 :: r_2 :: rs).\; \good \; r'\; \,  \land \; \, \textit{nonAlt}\; r')$\\
 		$\good \; \RSEQ{\RZERO}{r}$ & $\dn$ & $\bfalse$\\
 		$\good \; \RSEQ{\RONE}{r}$ & $\dn$ & $\bfalse$\\
 		$\good \; \RSEQ{r}{\RZERO}$ & $\dn$ & $\bfalse$\\
@@ -842,7 +891,8 @@
 	\end{tabular}
 \end{center}
 \noindent
-The predicate $\textit{nonAlt}$ evaluates to true when the regular expression is not an
+We omit the recursive definition of the predicate $\textit{nonAlt}$,
+which evaluates to true when the regular expression is not an
 alternative, and false otherwise.
 The $\good$ property is preserved under $\rsimp_{ALTS}$, provided that
 its non-empty argument list of expressions are all good themsleves, and $\textit{nonAlt}$, 
@@ -879,10 +929,13 @@
 which enables $\rsimp$ to be non-nested:
 
 \begin{lemma}\label{nonnestedRsimp}
-	$\nonnested \; (\rsimp{r})$
+	It is always the case that
+	\begin{center}
+		$\nonnested \; (\rsimp{r})$
+	\end{center}
 \end{lemma}
 \begin{proof}
-	By an induction on $r$.
+	By induction on $r$.
 \end{proof}
 \noindent
 With this we could prove that a regular expressions
@@ -896,13 +949,13 @@
 	By \ref{nonnestedRsimp}.
 \end{proof}
 \noindent
-The other thing we know is that once $\rsimp{}$ had finished
+The other fact we know is that once $\rsimp{}$ has finished
 processing an alternative regular expression, it will not
-contain any $\RZERO$s, this is because all the recursive 
+contain any $\RZERO$s. This is because all the recursive 
 calls to the simplification on the children regular expressions
-make the children good, and $\rflts$ will not take out
+make the children good, and $\rflts$ will not delete
 any $\RZERO$s out of a good regular expression list,
-and $\rdistinct{}$ will not mess with the result.
+and $\rdistinct{}$ will not ``mess'' with the result.
 \begin{lemma}\label{flts3Obv}
 	The following are true:
 	\begin{itemize}
@@ -931,7 +984,7 @@
 \end{lemma}
 \begin{proof}
 	By an induction on $r$. The inductive measure is the size $\llbracket \rrbracket_r$.
-	Lemma \ref{rsimpSize} says that 
+	Lemma \ref{rsimpMono} says that 
 	$\llbracket \rsimp{r}\rrbracket_r$ is smaller than or equal to
 	$\llbracket r \rrbracket_r$.
 	Therefore, in the $r_1 \cdot r_2$ and $\sum rs$ case,
@@ -953,7 +1006,7 @@
 \end{proof}
 \noindent
 Now we are ready to prove that good regular expressions are invariant
-of $\rsimp{}$ application:
+with respect to $\rsimp{}$:
 \begin{lemma}\label{test}
 	If $\good \;r$ then $\rsimp{r} = r$.
 \end{lemma}
@@ -964,16 +1017,19 @@
 	case where 2 or more elements are present in the list.
 \end{proof}
 \noindent
-Given below is a property involving $\rflts$, $\rdistinct{}{}$, $\rsimp{}$ and $\rsimp_{ALTS}$,
-which requires $\ref{good1}$ to go through smoothly.
-It says that an application of $\rsimp_{ALTS}$ can be "absorbed",
-if it its output is concatenated with a list and then applied to $\rflts$.
+Given below is a property involving $\rflts$, $\textit{rdistinct}$, 
+$\rsimp{}$ and $\rsimp_{ALTS}$,
+which requires $\ref{good1}$ to go through smoothly:
 \begin{lemma}\label{flattenRsimpalts}
+An application of $\rsimp_{ALTS}$ can be ``absorbed'',
+if its output is concatenated with a list and then applied to $\rflts$.
+\begin{center}
 	$\rflts \; ( (\rsimp_{ALTS} \; 
 	(\rdistinct{(\rflts \; (\map \; \rsimp{}\; rs))}{\varnothing})) :: 
 	\map \; \rsimp{} \; rs' ) = 
 	\rflts \; ( (\rdistinct{(\rflts \; (\map \; \rsimp{}\; rs))}{\varnothing}) @ (
 	\map \; \rsimp{rs'}))$
+\end{center}
 
 
 \end{lemma}
@@ -993,7 +1049,7 @@
 forms so that key steps allowing further rewriting to closed forms
 are possible.
 \begin{lemma}\label{rsimpIdem}
-	$\rsimp{r} = \rsimp{\rsimp{r}}$
+	$\rsimp{r} = \rsimp{(\rsimp{r})}$
 \end{lemma}
 
 \begin{proof}
@@ -1005,7 +1061,7 @@
 our definition of $\blexersimp$.
 
 
-On the other hand, we could repeat the same $\rsimp{}$ applications
+On the other hand, we can repeat the same $\rsimp{}$ applications
 on regular expressions as many times as we want, if we have at least
 one simplification applied to it, and apply it wherever we would like to:
 \begin{corollary}\label{headOneMoreSimp}
@@ -1019,8 +1075,8 @@
 	\end{itemize}
 \end{corollary}
 \noindent
-This will be useful in later closed form proof's rewriting steps.
-Similarly, we point out the following useful facts below:
+This will be useful in the later closed form proof's rewriting steps.
+Similarly, we state the following useful facts below:
 \begin{lemma}
 	The following equalities hold if $r = \rsimp{r'}$ for some $r'$:
 	\begin{itemize}
@@ -1041,10 +1097,18 @@
 we can start proving some key equalities leading to the 
 closed forms.
 Now presented are a few equivalent terms under $\rsimp{}$.
-We use $r_1 \sequal r_2 $ here to denote $\rsimp{r_1} = \rsimp{r_2}$.
+To make the notation more concise
+We use $r_1 \sequal r_2 $ to denote that $\rsimp{r_1} = \rsimp{r_2}$:
+\begin{center}
+\begin{tabular}{lcl}
+	$a \sequal b$ & $ \dn$ & $ \textit{rsimp} \; a = \textit{rsimp} \; b$
+\end{tabular}
+\end{center}
+\noindent
+%\vspace{0em}
 \begin{lemma}
+	The following equivalence hold:
 	\begin{itemize}
-		The following equivalence hold:
 	\item
 		$\rsimpalts \; (\RZERO :: rs) \sequal \rsimpalts\; rs$
 	\item
@@ -1063,7 +1127,7 @@
 \noindent
 The above allows us to prove
 two similar equalities (which are a bit more involved).
-It says that we could flatten out the elements
+It says that we could flatten the elements
 before simplification and still get the same result.
 \begin{lemma}\label{simpFlatten3}
 	One can flatten the inside $\sum$ of a $\sum$ if it is being 
@@ -1099,7 +1163,7 @@
 \noindent
 
 
-We need more equalities like the above to enable a closed form,
+We need more equalities like the above to enable a closed form lemma,
 for which we need to introduce a few rewrite relations
 to help
 us obtain them.
@@ -1211,11 +1275,11 @@
 operations}\label{gRewrite}
 \end{figure}
 \noindent
-We defined
+We define
 two separate list rewriting relations $\frewrite$ and $\grewrite$.
 The rewriting steps that take place during
 flattening is characterised by $\frewrite$.
-$\grewrite$ characterises both flattening and de-duplicating.
+The rewrite relation $\grewrite$ characterises both flattening and de-duplicating.
 Sometimes $\grewrites$ is slightly too powerful
 so we would rather use $\frewrites$ to prove
 %because we only
@@ -1236,19 +1300,24 @@
 \end{lemma}
 \noindent
 and then the equivalence between two terms
-that can reduce in many steps to each other.
+that can reduce in many steps to each other:
 \begin{lemma}\label{frewritesSimpeq}
 	If $rs_1 \frewrites rs_2 $, then $\sum (\rDistinct \; rs_1 \; \varnothing) \sequal 
 	\sum (\rDistinct \;  rs_2 \;  \varnothing)$.
 \end{lemma}
 \noindent
+These two lemmas can both be proven using a straightforward induction (and
+the proofs for them are therefore omitted).
 
+Now the above equalities can be derived like a breeze:
 \begin{corollary}
 	$\sum (\rDistinct \;\; (\map \; (\_ \backslash x) \; (\rflts \; rs)) \;\; \varnothing) \sequal
 	\sum (\rDistinct \;\;  (\rflts \; (\map \; (\_ \backslash x) \; rs)) \;\; \varnothing)
 	$
 \end{corollary}
-
+\begin{proof}
+	By lemmas \ref{earlyLaterDerFrewrites} and \ref{frewritesSimpeq}.
+\end{proof}
 But this trick will not work for $\grewrites$.
 For example, a rewriting step in proving
 closed forms is:
@@ -1258,7 +1327,7 @@
 	$\rsimp{(\rsimpalts \; (\rdistinct{(\map \; (\_ \backslash x) \; (\rflts \; (\map \; (\rsimp{} \; \circ \; (\lambda r. \rderssimp{r}{xs})))) ) }{\varnothing}))} $
 	\noindent
 \end{center}
-For this one would hope to have a rewriting relation between the two lists involved,
+For this, one would hope to have a rewriting relation between the two lists involved,
 similar to \ref{earlyLaterDerFrewrites}. However, it turns out that 
 \begin{center}
 	$\map \; (\_ \backslash x) \; (\rDistinct \; rs \; rset) \grewrites \rDistinct \; (\map \;
@@ -1267,7 +1336,7 @@
 \noindent
 does $\mathbf{not}$ hold in general.
 For this rewriting step we will introduce some slightly more cumbersome
-proof technique in later sections.
+proof technique later.
 The point is that $\frewrite$
 allows us to prove equivalence in a straightforward way that is 
 not possible for $\grewrite$. 
@@ -1275,12 +1344,12 @@
 
 \subsubsection{Terms That Can Be Rewritten Using $\hrewrites$, $\grewrites$, and $\frewrites$}
 In this part, we present lemmas stating
-pairs of r-regular expressions or r-regular expression lists
+pairs of r-regular expressions and r-regular expression lists
 where one could rewrite from one in many steps to the other.
 Most of the proofs to these lemmas are straightforward, using
-an induction on the inductive cases of the corresponding rewriting relations.
+an induction on the corresponding rewriting relations.
 These proofs will therefore be omitted when this is the case.
-We present in the below lemma a few pairs of terms that are rewritable via 
+We present in the following lemma a few pairs of terms that are rewritable via 
 $\grewrites$:
 \begin{lemma}\label{gstarRdistinctGeneral}
 	\mbox{}
@@ -1302,6 +1371,7 @@
 If a pair of terms $rs_1, rs_2$ are rewritable via $\grewrites$ to each other,
 then they are equivalent under $\rsimp{}$:
 \begin{lemma}\label{grewritesSimpalts}
+	\mbox{}
 	If $rs_1 \grewrites rs_2$, then
 	we have the following equivalence:
 	\begin{itemize}
@@ -1330,7 +1400,7 @@
 	\end{itemize}
 \end{lemma}
 \noindent
-Here comes the meat of the proof, 
+Now comes the core of the proof, 
 which says that once two lists are rewritable to each other,
 then they are equivalent under $\rsimp{}$:
 \begin{lemma}
@@ -1338,15 +1408,17 @@
 \end{lemma}
 
 \noindent
-And similar to \ref{Bitcoded2} one can preserve rewritability after taking derivative
-of two regular expressions on both sides:
+Similar to what we did in \ref{Bitcoded2}, 
+we prove that if one can rewrite from one r-regular expression ($r$)
+to the other ($r'$), after taking derivatives one could still rewrite
+the first ($r\backslash c$) to the other ($r'\backslash c$).
 \begin{lemma}\label{interleave}
 	If $r \hrewrites r' $ then $\rder{c}{r} \hrewrites \rder{c}{r'}$
 \end{lemma}
 \noindent
-This allows proving more $\mathbf{rsimp}$-equivalent terms, involving $\backslash_r$ now.
+This allows us proving more $\mathbf{rsimp}$-equivalent terms, involving $\backslash_r$.
 \begin{lemma}\label{insideSimpRemoval}
-	$\rsimp{\rder{c}{\rsimp{r}}} = \rsimp{\rder{c}{r}}  $
+	$\rsimp{(\rder{c}{(\rsimp{r})})} = \rsimp{(\rder{c}{r})}  $
 \end{lemma}
 \noindent
 \begin{proof}
@@ -1373,15 +1445,15 @@
 \noindent
 
 \subsection{Closed Forms for $\sum rs$, $r_1\cdot r_2$ and $r^*$}
-\subsubsection{Closed Form for Alternative Regular Expression}
-Lemma \ref{Simpders} leads to the first closed form--
-\begin{lemma}\label{altsClosedForm}
+Lemma \ref{Simpders} leads to our first closed form,
+which is for the alternative regular expression:
+\begin{theorem}\label{altsClosedForm}
+	\mbox{}
 	\begin{center}
 		$\rderssimp{(\sum rs)}{s} \sequal
 		\sum \; (\map \; (\rderssimp{\_}{s}) \; rs)$
 	\end{center}
-\end{lemma}
-
+\end{theorem}
 \noindent
 \begin{proof}
 	By a reverse induction on the string $s$.
@@ -1402,7 +1474,7 @@
 \end{proof}
 \noindent
 This closed form has a variant which can be more convenient in later proofs:
-\begin{corollary}{altsClosedForm1}
+\begin{corollary}\label{altsClosedForm1}
 	If $s \neq []$ then 
 	$\rderssimp \; (\sum \; rs) \; s = 
 	\rsimp{(\sum \; (\map \; \rderssimp{\_}{s} \; rs))}$.
@@ -1414,15 +1486,23 @@
 
 
 \subsubsection{Closed Form for Sequence Regular Expressions}
-First let's look at a series of derivatives steps on a sequence 
-regular expression, assuming that each time the first
-component of the sequence is always nullable):
+For the sequence regular expression,
+let's first look at a series of derivative steps on it 
+(assuming that each time when a derivative is taken,
+the head of the sequence is always nullable):
 \begin{center}
-
-	$r_1 \cdot r_2 \quad \longrightarrow_{\backslash c}  \quad   r_1  \backslash c \cdot r_2 + r_2 \backslash c \quad \longrightarrow_{\backslash c'} \quad (r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc' \longrightarrow_{\backslash c''} \quad$\\
-	$((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') + r_2 \backslash cc'c''   \longrightarrow_{\backslash c''} \quad
-	\ldots$
-
+	\begin{tabular}{llll}
+		$r_1 \cdot r_2$ &  
+		$\longrightarrow_{\backslash c}$ &  
+		$r_1\backslash c \cdot r_2 + r_2 \backslash c$ &
+		$ \longrightarrow_{\backslash c'} $ \\ 
+		\\
+		$(r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc'$ & 
+		$\longrightarrow_{\backslash c''} $ &
+		$((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') 
+		+ r_2 \backslash cc'c''$ & 
+		$   \longrightarrow_{\backslash c''} \quad \ldots$\\
+	\end{tabular}
 \end{center}
 Roughly speaking $r_1 \cdot r_2 \backslash s$ can be expresssed as 
 a giant alternative taking a list of terms 
@@ -1430,32 +1510,50 @@
 where the head of the list is always the term
 representing a match involving only $r_1$, and the tail of the list consisting of
 terms of the shape $r_2 \backslash_r s''$, $s''$ being a suffix of $s$.
-This intuition is also echoed by IndianPaper, where they gave
+This intuition is also echoed by Murugesan and Sundaram \cite{Murugesan2014}, 
+where they gave
 a pencil-and-paper derivation of $(r_1 \cdot r_2)\backslash s$:
 \begin{center}
-	\begin{tabular}{c}
-		$(r_1 \cdot r_2) \backslash_r (c_1 :: c_2 :: \ldots c_n) \myequiv$\\ 
-		\rule{0pt}{3ex} $((r_1 \backslash_r c_1) \cdot r_2 + (\delta\; (\rnullable \; r_1) \; r_2 \backslash_r c_1)) \backslash_r (c_2 :: \ldots c_n) 
-		\myequiv$\\
-		\rule{0pt}{3ex} $((r_1 \backslash_r c_1c_2 \cdot r_2 + (\delta \; (\rnullable \; r_1) \; r_2 \backslash_r c_1c_2))
-		+ (\delta \ (\rnullable \; r_1 \backslash_r c)\; r_2 \backslash_r c_2)) \backslash_r (c_3 \ldots c_n)
-		$
+	\begin{tabular}{lc}
+		$L \; [ (r_1 \cdot r_2) \backslash_r (c_1 :: c_2 :: \ldots c_n) ]$ & $ =$\\
+		\\
+		\rule{0pt}{3ex} $L \; [ ((r_1 \backslash_r c_1) \cdot r_2 + 
+		(\delta\; (\nullable \; r_1) \; (r_2 \backslash_r c_1) )) \backslash_r 
+		(c_2 :: \ldots c_n) ]$ &
+		$=$\\
+		\\
+		\rule{0pt}{3ex} $L \; [ ((r_1 \backslash_r c_1c_2 \cdot r_2 + 
+		(\delta \; (\nullable \; r_1) \; (r_2 \backslash_r c_1c_2)))
+		$ & \\
+		\\
+		$\quad + (\delta \ (\nullable \; r_1 \backslash_r c)\; (r_2 \backslash_r c_2) )) 
+		\backslash_r (c_3 \ldots c_n) ]$ & $\ldots$ \\
 	\end{tabular}
 \end{center}
 \noindent
-The equality in above should be interpretated
-as language equivalence. 
-The $\delta$ function works similarly to that of
-a Kronecker delta function:
-\[ \delta \; b\; r\]
-will produce $r$
-if $b$ evaluates to true, 
-and $\RZERO$ otherwise.
-Note that their formulation  
-\[
-	((r_1 \backslash_r \, c_1c_2 \cdot r_2 + (\delta \; (\rnullable) \; r_1, r_2 \backslash_r c_1c_2)
-	+ (\delta \; (\rnullable \; r_1 \backslash_r c)\; r_2 \backslash_r c_2)
-\]
+The $\delta$ function 
+returns $r$ when the boolean condition
+$b$ evaluates to true and
+$\ZERO$ otherwise:
+\begin{center}
+	\begin{tabular}{lcl}
+		$\delta \; b\; r$ & $\dn$ & $r \quad \textit{if} \; b \; is \;\textit{true}$\\
+				  & $\dn$ & $\ZERO \quad otherwise$
+	\end{tabular}
+\end{center}
+\noindent
+Note that the term
+\begin{center}
+	\begin{tabular}{lc}
+		\rule{0pt}{3ex} $((r_1 \backslash_r c_1c_2 \cdot r_2 + 
+		(\delta \; (\nullable \; r_1) \; (r_2 \backslash_r c_1c_2)))
+		$ & \\
+		\\
+		$\quad + (\delta \ (\nullable \; r_1 \backslash_r c)\; (r_2 \backslash_r c_2) )) 
+		\backslash_r (c_3 \ldots c_n)$ &\\
+	\end{tabular}
+\end{center}
+\noindent
 does not faithfully
 represent what the intermediate derivatives would actually look like
 when one or more intermediate results $r_1 \backslash s' \cdot r_2$ are not 
@@ -1463,59 +1561,88 @@
 For example, when $r_1$ and $r_1 \backslash_r c_1$ are not nullable,
 the regular expression would not look like 
 \[
-	(r_1 \backslash_r c_1c_2 + \RZERO ) + \RZERO,
+	r_1 \backslash_r c_1c_2
 \]
-but actually $r_1 \backslash_r c_1c_2$, the redundant $\RZERO$s will not be created in the
+instead of
+\[
+	(r_1 \backslash_r c_1c_2 + \ZERO ) + \ZERO.
+\]
+The redundant $\ZERO$s will not be created in the
 first place.
-In a closed-form one would want to take into account this 
-and generate the list of
-regular expressions $r_2 \backslash_r s''$ with
-string pairs $(s', s'')$ where $s'@s'' = s$ and
-$r_1 \backslash s'$ nullable.
-We denote the list consisting of such 
-strings $s''$ as $\vsuf{s}{r_1}$.
-
-The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string:
+In a closed-form one needs to take into account this (because
+closed forms require exact equality rather than language equivalence)
+and only generate the 
+$r_2 \backslash_r s''$ terms satisfying the property
+\begin{center}
+$\exists  s'.  such \; that \; s'@s'' = s \;\; \land \;\;
+r_1 \backslash s' \; is \; nullable$.
+\end{center}
+Given the arguments $s$ and $r_1$, we denote the list of strings
+$s''$ satisfying the above property as $\vsuf{s}{r_1}$.
+The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string\footnote{
+	Perhaps a better name of it would be ``NullablePrefixSuffix''
+	to differentiate with the list of \emph{all} prefixes of $s$, but
+	that is a bit too long for a function name and we are yet to find
+a more concise and easy-to-understand name.}
 \begin{center}
 	\begin{tabular}{lcl}
 		$\vsuf{[]}{\_} $ & $=$ &  $[]$\\
-		$\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} (\rnullable{r_1}) \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\
+		$\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} \; (\rnullable{r_1}) \; \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\
 				    && $\textit{else} \; (\vsuf{cs}{(\rder{c}{r_1}) })  $
 	\end{tabular}
 \end{center}
 \noindent
-The list is sorted in the order $r_2\backslash s''$ 
-appears in $(r_1\cdot r_2)\backslash s$.
+The list starts with shorter suffixes
+and ends with longer ones (in other words, the string elements $s''$
+in the list $\vsuf{s}{r_1}$ are sorted
+in the same order as that of the terms $r_2\backslash s''$ 
+appearing in $(r_1\cdot r_2)\backslash s$).
 In essence, $\vsuf{\_}{\_}$ is doing a 
 "virtual derivative" of $r_1 \cdot r_2$, but instead of producing 
 the entire result $(r_1 \cdot r_2) \backslash s$, 
-it only stores all the strings $s''$ such that $r_2 \backslash s''$
-are occurring terms in $(r_1\cdot r_2)\backslash s$.
+it only stores strings,
+with each string $s''$ representing a term such that $r_2 \backslash s''$
+is occurring in $(r_1\cdot r_2)\backslash s$.
 
-To make the closed form representation 
-more straightforward,
-the flattetning function $\sflat{\_}$ is used to enable the transformation from 
+With $\textit{Suffix}$ we are ready to express the
+sequence regular expression's closed form,
+but before doing so 
+more devices are needed.
+The first thing is the flattening function $\sflat{\_}$,
+which takes an alternative regular expression and produces a flattened version
+of that alternative regular expression.
+It is needed to convert
 a left-associative nested sequence of alternatives into 
 a flattened list:
 \[
-	\sum [r_1, r_2, r_3, \ldots] \stackrel{\sflat{\_}}{\rightarrow} 
-	(\ldots ((r_1 + r_2) + r_3) + \ldots)
+	\sum(\ldots ((r_1 + r_2) + r_3) + \ldots)
+	\stackrel{\sflat{\_}}{\rightarrow} 
+	\sum[r_1, r_2, r_3, \ldots]
 \]
 \noindent
-The definitions $\sflat{\_}$, $\sflataux{\_}$ are given below.
+The definitions of $\sflat{\_}$ and helper functions
+$\sflataux{\_}$ and $\llparenthesis \_ \rrparenthesis''$ are given below.
 \begin{center}  
-	\begin{tabular}{ccc}
-		$\sflataux{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\
-		$\sflataux{\AALTS{ }{[]}}$ & $ = $ & $ []$\\
-		$\sflataux r$ & $=$ & $ [r]$
+	\begin{tabular}{lcl}
+		$\sflataux{\sum r :: rs}$ & $\dn$ & $\sflataux{r} @ rs$\\
+		$\sflataux{\sum []}$ & $ \dn $ & $ []$\\
+		$\sflataux r$ & $\dn$ & $ [r]$
 	\end{tabular}
 \end{center}
 
 \begin{center} 
-	\begin{tabular}{ccc}
-		$\sflat{(\sum r :: rs)}$ & $=$ & $\sum (\sflataux{r} @ rs)$\\
-		$\sflat{\sum []}$ & $ = $ & $ \sum []$\\
-		$\sflat r$ & $=$ & $ r$
+	\begin{tabular}{lcl}
+		$\sflat{(\sum r :: rs)}$ & $\dn$ & $\sum (\sflataux{r} @ rs)$\\
+		$\sflat{\sum []}$ & $ \dn $ & $ \sum []$\\
+		$\sflat r$ & $\dn$ & $ r$
+	\end{tabular}
+\end{center}
+
+\begin{center}  
+	\begin{tabular}{lcl}
+		$\sflataux{[]}'$ & $ \dn $ & $ []$\\
+		$\sflataux{ (r_1 + r_2) :: rs }'$ & $\dn$ & $r_1 :: r_2 :: rs$\\
+		$\sflataux{r :: rs}$ & $\dn$ & $ r::rs$
 	\end{tabular}
 \end{center}
 \noindent
@@ -1527,19 +1654,63 @@
 the output type a regular expression, not a list.
 $\sflataux{\_}$  and $\sflat{\_}$ are only recursive on the  
 first element of the list.
+$\sflataux{\_}'$ takes a list of regular expressions as input, and outputs
+a list of regular expressions.
+The use of $\sflataux{\_}$ and $\sflataux{\_}'$ is clear once we have
+$\textit{createdBySequence}$ defined:
+\begin{center}
+	\begin{mathpar}
+		\inferrule{\mbox{}}{\textit{createdBySequence}\; (r_1 \cdot r_2)}
 
-With $\sflataux{}$ a preliminary to the closed form can be stated,
-where the derivative of $r_1 \cdot r_2 \backslash s$ can be
-flattened into a list whose head and tail meet the description
-we gave earlier.
+		\inferrule{\textit{createdBySequence} \; r_1}{\textit{createdBySequence} \;
+		(r_1 + r_2)}
+	\end{mathpar}
+\end{center}
+\noindent
+The predicate $\textit{createdBySequence}$ is used to describe the shape of
+the derivative regular expressions $(r_1\cdot r_2) \backslash s$:
+\begin{lemma}\label{recursivelyDerseq}
+	It is always the case that
+	\begin{center}
+		$\textit{createdBySequence} \; ( (r_1\cdot r_2) \backslash_r s) $
+	\end{center}
+	holds.
+\end{lemma}
+\begin{proof}
+	By a reverse induction on the string $s$, where the inductive cases are $[]$
+	and $xs  @ [x]$.
+\end{proof}
+\noindent
+If we have a regular expression $r$ whose shpae 
+fits into those described by $\textit{createdBySequence}$,
+then we can convert between 
+$r \backslash_r c$ and $(\sflataux{r}) \backslash_r c$ with
+$\sflataux{\_}'$:
+\begin{lemma}\label{sfauIdemDer}
+	If $\textit{createdBySequence} \; r$, then 
+	\begin{center}
+		$\sflataux{ r \backslash_r c} = 
+		\llparenthesis (\map \; (\_ \backslash_r c) \; (\sflataux{r}) ) \rrparenthesis''$
+	\end{center}
+	holds.
+\end{lemma}
+\begin{proof}
+	By a simple induction on the inductive cases of $\textit{createdBySequence}.
+	$
+\end{proof}
+
+Now we are ready to express
+the shape of $r_1 \cdot r_2 \backslash s$
 \begin{lemma}\label{seqSfau0}
-	$\sflataux{\rders{(r_1 \cdot r_2) \backslash s }} = (r_1 \backslash_r s) \cdot r_2 
+	$\sflataux{(r_1 \cdot r_2) \backslash_r s} = (r_1 \backslash_r s) \cdot r_2 
 	:: (\map \; (r_2 \backslash_r \_) \; (\textit{Suffix} \; s \; r1))$ 
 \end{lemma}
 \begin{proof}
-	By an induction on the string $s$, where the inductive cases 
-	are split as $[]$ and $xs @ [x]$.
-	Note the key identify holds:
+	By a reverse induction on the string $s$, where the inductive cases 
+	are $[]$ and $xs @ [x]$.
+	For the inductive case, we know that $\textit{createdBySequence} \; ((r_1 \cdot r_2)
+	\backslash_r xs)$ holds from lemma \ref{recursivelyDerseq},
+	which can be used to prove
 	\[
 		\map \; (r_2 \backslash_r \_) \; (\vsuf{[x]}{(r_1 \backslash_r xs)}) \;\; @ \;\;
 		\map \; (\_ \backslash_r x) \; (\map \; (r_2 \backslash \_) \; (\vsuf{xs}{r_1}))
@@ -1548,96 +1719,126 @@
 	\[
 		\map \; (r_2 \backslash_r \_) \; (\vsuf{xs @ [x]}{r_1})
 	\]
-	This enables the inductive case to go through.
+	using lemma \ref{sfauIdemDer}.
+	This equality enables the inductive case to go through.
 \end{proof}
 \noindent 
-Note that this lemma does $\mathbf{not}$ depend on any
-specific definitions we used,
-allowing people investigating derivatives to get an alternative
-view of what $r_1 \cdot r_2$ is.
+This lemma says that $(r_1\cdot r_2)\backslash s$
+can be flattened into a list whose head and tail meet the description
+we gave earlier.
+%Note that this lemma does $\mathbf{not}$ depend on any
+%specific definitions we used,
+%allowing people investigating derivatives to get an alternative
+%view of what $r_1 \cdot r_2$ is.
 
-Now we are able to use this for the intuition that 
-the different ways in which regular expressions are 
-nested do not matter under $\rsimp{}$:
-\begin{center}
-	$\rsimp{r} \stackrel{?}{\sequal} \rsimp{r'}$ if $r = \sum [r_1, r_2, r_3, \ldots]$ 
-	and $r' =(\ldots ((r_1 + r_2) + r_3) + \ldots)$
-\end{center}
-Simply wrap with $\sum$ constructor and add 
-simplifications to both sides of \ref{seqSfau0}
-and one gets
-\begin{corollary}\label{seqClosedFormGeneral}
+We now use $\textit{createdBySequence}$ and
+$\sflataux{\_}$ to describe an intuition
+behind the closed form of the sequence closed form.
+If two regular expressions only differ in the way their
+alternatives are nested, then we should be able to get the same result
+once we apply simplification to both of them:
+\begin{lemma}\label{sflatRsimpeq}
+	If $r$ is created from a sequence through
+	a series of derivatives 
+	(i.e. if $\textit{createdBySequence} \; r$ holds), 
+	and that $\sflataux{r} = rs$,
+	then we have
+	that 
+	\begin{center}
+		$\textit{rsimp} \; r = \textit{rsimp} \; (\sum \; rs)$
+	\end{center}
+	holds.
+\end{lemma}
+\begin{proof}
+	By an induction on the inductive cases of $\textit{createdBySequence}$. 
+\end{proof}
+
+Now we are ready for the closed form 
+for the sequence regular expressions (without the inner applications
+of simplifications):
+\begin{lemma}\label{seqClosedFormGeneral}
 	$\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} }
 	=\rsimp{(\sum (  (r_1 \backslash s) \cdot r_2 :: 
 	\map\; (r_2 \backslash \_) \; (\vsuf{s}{r_1})))}$
-\end{corollary}
-Together with the idempotency property of $\rsimp{}$ (lemma \ref{rsimpIdem}),
-it is possible to convert the above lemma to obtain a "closed form"
-for  derivatives nested with simplification:
-\begin{lemma}\label{seqClosedForm}
-	$\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{(\sum ((r_1 \backslash s) \cdot r_2 ) 
-	:: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1})))}$
 \end{lemma}
 \begin{proof}
-	By a case analysis of string $s$.
-	When $s$ is empty list, the rewrite is straightforward.
-	When $s$ is a list, one could use the corollary \ref{seqSfau0},
-	and lemma \ref{Simpders} to rewrite the left-hand-side.
+	We know that 
+	$\sflataux{(r_1 \cdot r_2) \backslash_r s} = (r_1 \backslash_r s) \cdot r_2 
+	:: (\map \; (r_2 \backslash_r \_) \; (\textit{Suffix} \; s \; r1))$
+	holds
+	by lemma \ref{seqSfau0}.
+	This allows the theorem to go through because of lemma \ref{sflatRsimpeq}.
 \end{proof}
-As a corollary for this closed form, one can estimate the size 
-of the sequence derivative $r_1 \cdot r_2 \backslash_r s$ using 
-an easier-to-handle expression:
-\begin{corollary}\label{seqEstimate1}
-	\begin{center}
-
-		$\llbracket \rderssimp{(r_1 \cdot r_2)}{s} \rrbracket_r = \llbracket \rsimp{(\sum ((r_1 \backslash s) \cdot r_2 ) 
-		:: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1})))} \rrbracket_r$
-
-	\end{center}
-\end{corollary}
-\noindent
+Together with the idempotency property of $\rsimp{}$ (lemma \ref{rsimpIdem}),
+it is possible to convert the above lemma to obtain the
+proper closed form for $\backslash_{rsimps}$ rather than $\backslash_r$:
+for  derivatives nested with simplification:
+\begin{theorem}\label{seqClosedForm}
+	$\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{(\sum ((r_1 \backslash s) \cdot r_2 ) 
+	:: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1})))}$
+\end{theorem}
+\begin{proof}
+	By a case analysis of the string $s$.
+	When $s$ is an empty list, the rewrite is straightforward.
+	When $s$ is a non-empty list, the
+	lemmas \ref{seqClosedFormGeneral} and \ref{Simpders} apply,
+	making the proof go through.
+\end{proof}
 \subsubsection{Closed Forms for Star Regular Expressions}
-We have shown how to control the size of the sequence regular expression $r_1\cdot r_2$ using
-the "closed form" of $(r_1 \cdot r_2) \backslash s$ and then 
-the property of the $\distinct$ function.
-Now we try to get a bound on $r^* \backslash s$ as well.
-Again, we first look at how a star's derivatives evolve, if they grow maximally: 
+The closed form for the star regular expression involves similar tricks
+for the sequence regular expression.
+The $\textit{Suffix}$ function is now replaced by something
+slightly more complex, because the growth pattern of star
+regular expressions' derivatives is a bit different:
 \begin{center}
-
-	$r^* \quad \longrightarrow_{\backslash c}  \quad   (r\backslash c)  \cdot  r^* \quad \longrightarrow_{\backslash c'}  \quad
-	r \backslash cc'  \cdot r^* + r \backslash c' \cdot r^*  \quad \longrightarrow_{\backslash c''} \quad 
-	(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*)   \quad \longrightarrow_{\backslash c'''}
-	\quad \ldots$
-
+	\begin{tabular}{lclc}
+		$r^* $ & $\longrightarrow_{\backslash c}$ & 
+		$(r\backslash c)  \cdot  r^*$ & $\longrightarrow_{\backslash c'}$\\
+		\\
+		$r \backslash cc'  \cdot r^* + r \backslash c' \cdot r^*$ &
+		$\longrightarrow_{\backslash c''}$ & 
+		$(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + 
+		(r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*)$ & 
+		$\longrightarrow_{\backslash c'''}$ \\
+		\\
+		$\ldots$\\
+	\end{tabular}
 \end{center}
 When we have a string $s = c :: c' :: c'' \ldots$  such that $r \backslash c$, $r \backslash cc'$, $r \backslash c'$, 
 $r \backslash cc'c''$, $r \backslash c'c''$, $r\backslash c''$ etc. are all nullable,
-the number of terms in $r^* \backslash s$ will grow exponentially, causing the size
-of the derivatives $r^* \backslash s$ to grow exponentially, even if we do not 
-count the possible size explosions of $r \backslash c$ themselves.
+the number of terms in $r^* \backslash s$ will grow exponentially rather than linearly
+in the sequence case.
+The good news is that the function $\textit{rsimp}$ will again
+ignore the difference between differently nesting patterns of alternatives,
+and the exponentially growing star derivative like
+\begin{center}
+	$(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + 
+	(r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ 
+\end{center}
+can be treated as
+\begin{center}
+	$\RALTS{[r_1 \backslash cc'c'' \cdot r^*, r \backslash c'', 
+	r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$
+\end{center}
+which can be de-duplicated by $\rDistinct$
+and therefore bounded finitely.
 
-Thanks to $\rflts$ and $\rDistinct$, we are able to open up regular expressions like
-$(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + 
-(r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ 
-into $\RALTS{[r_1 \backslash cc'c'' \cdot r^*, r \backslash c'', 
-r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$
-and then de-duplicate terms of the form $r\backslash s' \cdot r^*$ ($s'$ being a substring of $s$).
+and then de-duplicate terms of the form  ($s'$ being a substring of $s$).
 This allows us to use a similar technique as $r_1 \cdot r_2$ case,
-where the crux is to get an equivalent form of 
-$\rderssimp{r^*}{s}$ with shape $\rsimp{\sum rs}$.
-This requires generating 
-all possible sub-strings $s'$ of $s$
-such that $r\backslash s' \cdot r^*$ will appear 
-as a term in $(r^*) \backslash s$.
-The first function we define is a single-step
-updating function $\starupdate$, which takes three arguments as input:
-the new character $c$ to take derivative with, 
-the regular expression
-$r$ directly under the star $r^*$, and the
-list of strings $sSet$ for the derivative $r^* \backslash s$ 
-up til this point  
-such that $(r^*) \backslash s = \sum_{s' \in sSet} (r\backslash s') \cdot r^*$ 
-(the equality is not exact, more on this later).
+
+Now the crux of this section is finding a suitable description
+for $rs$ where
+\begin{center}
+	$\rderssimp{r^*}{s} = \rsimp{\sum rs}$.
+\end{center}
+holds.
+In addition, the list $rs$
+shall be in the form of 
+$\map \; (\lambda s'. r\backslash s' \cdot r^*) \; Ss$.
+The $Ss$ is a list of strings, and for example in the sequence
+closed form it is specified as $\textit{Suffix} \; s \; r_1$.
+To get $Ss$ for the star regular expression,
+we need to introduce $\starupdate$ and $\starupdates$:
 \begin{center}
 	\begin{tabular}{lcl}
 		$\starupdate \; c \; r \; [] $ & $\dn$ & $[]$\\
@@ -1650,11 +1851,6 @@
 						     \starupdate \; c \; r \; Ss)$
 	\end{tabular}
 \end{center}
-\noindent
-As a generalisation from characters to strings,
-$\starupdates$ takes a string instead of a character
-as the first input argument, and is otherwise the same
-as $\starupdate$.
 \begin{center}
 	\begin{tabular}{lcl}
 		$\starupdates \; [] \; r \; Ss$ & $=$ & $Ss$\\
@@ -1663,12 +1859,33 @@
 	\end{tabular}
 \end{center}
 \noindent
-For the star regular expression,
-its derivatives can be seen as  a nested gigantic
-alternative similar to that of sequence regular expression's derivatives, 
-and therefore need
-to be ``straightened out" as well.
-The function for this would be $\hflat{}$ and $\hflataux{}$.
+Assuming we have that
+\begin{center}
+	$\rderssimp{r^*}{s} = \rsimp{(\sum \map \; (\lambda s'. r\backslash s' \cdot r^*) \; Ss)}$.
+\end{center}
+holds.
+The idea of $\starupdate$ and $\starupdates$
+is to update $Ss$ when another
+derivative is taken on $\rderssimp{r^*}{s}$
+w.r.t a character $c$ and a string $s'$
+respectively.
+Both $\starupdate$ and $\starupdates$ take three arguments as input:
+the new character $c$ or string $s$ to take derivative with, 
+the regular expression
+$r$ under the star $r^*$, and the
+list of strings $Ss$ for the derivative $r^* \backslash s$ 
+up until this point  
+such that 
+\begin{center}
+$(r^*) \backslash s = \sum_{s' \in sSet} (r\backslash s') \cdot r^*$ 
+\end{center}
+is satisfied.
+
+Functions $\starupdate$ and $\starupdates$ characterise what the 
+star derivatives will look like once ``straightened out'' into lists.
+The helper functions for such operations will be similar to
+$\sflat{\_}$, $\sflataux{\_}$ and $\sflataux{\_}$, which we defined for sequence.
+We use similar symbols to denote them, with a $*$ subscript to mark the difference.
 \begin{center}
 	\begin{tabular}{lcl}
 		$\hflataux{r_1 + r_2}$ & $\dn$ & $\hflataux{r_1} @ \hflataux{r_2}$\\
@@ -1683,122 +1900,230 @@
 	\end{tabular}
 \end{center}
 \noindent
-%MAYBE TODO: introduce createdByStar
-Again these definitions are tailor-made for dealing with alternatives that have
-originated from a star's derivatives, so we do not attempt to open up all possible 
-regular expressions of the form $\RALTS{rs}$, where $\textit{rs}$ might not contain precisely 2
-elements.
-We give a predicate for such "star-created" regular expressions:
+These definitions are tailor-made for dealing with alternatives that have
+originated from a star's derivatives.
+A typical star derivative always has the structure of a balanced binary tree:
 \begin{center}
-	\begin{tabular}{lcr}
-	 &    &       $\createdByStar{(\RSEQ{ra}{\RSTAR{rb}}) }$\\
-		$\createdByStar{r_1} \land \createdByStar{r_2} $ & $ \Longrightarrow$ & $\createdByStar{(r_1 + r_2)}$
-	\end{tabular}
+	$(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + 
+	(r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ 
 \end{center}
+All of the nested structures of alternatives
+generated from derivatives are binary, and therefore
+$\hflat{\_}$ and $\hflataux{\_}$ only deal with binary alternatives.
+$\hflat{\_}$ ``untangles'' like the following:
+\[
+	\sum ((r_1 + r_2) + (r_3 + r_4))  + \ldots \;
+	\stackrel{\hflat{\_}}{\longrightarrow} \;
+	\RALTS{[r_1, r_2, \ldots, r_n]} 
+\]
+Here is a lemma stating the recursive property of $\starupdate$ and $\starupdates$,
+with the helpers $\hflat{\_}$ and $\hflataux{\_}$\footnote{The function $\textit{concat}$ takes a list of lists 
+			and merges each of the element lists to form a flattened list.}:
+\begin{lemma}\label{stupdateInduct1}
+	\mbox
+	For a list of strings $Ss$, the following hold.
+	\begin{itemize}
+		\item
+			If we do a derivative on the terms 
+			$r\backslash_r s \cdot r^*$ (where $s$ is taken from the list $Ss$),
+			the result will be the same as if we apply $\starupdate$ to $Ss$.
+			\begin{center}
+				\begin{tabular}{c}
+			$\textit{concat} \; (\map \; (\hflataux{\_} \circ ( (\_\backslash_r x)
+			\circ (\lambda s.\;\; (r \backslash_r s) \cdot r^*)))\; Ss )\;
+			$\\
+			\\
+			$=$ \\
+			\\
+			$\map \; (\lambda s. (r \backslash_r s) \cdot (r^*)) \; 
+			(\starupdate \; x \; r \; Ss)$
+				\end{tabular}
+			\end{center}
+		\item
+			$\starupdates$ is ``composable'' w.r.t a derivative.
+			It piggybacks the character $x$ to the tail of the string
+			$xs$.
+			\begin{center}
+				\begin{tabular}{c}
+					$\textit{concat} \; (\map \; \hflataux{\_} \; 
+					(\map \; (\_\backslash_r x) \; 
+					(\map \; (\lambda s.\;\; (r \backslash_r s) \cdot 
+					(r^*) ) \; (\starupdates \; xs \; r \; Ss))))$\\
+					\\
+					$=$\\
+					\\
+					$\map \; (\lambda s.\;\; (r\backslash_r s) \cdot (r^*)) \;
+					(\starupdates \; (xs @ [x]) \; r \; Ss)$
+				\end{tabular}
+			\end{center}
+	\end{itemize}
+\end{lemma}
+			
+\begin{proof}
+	Part 1 is by induction on $Ss$.
+	Part 2 is by induction on $xs$, where $Ss$ is left to take arbitrary values.
+\end{proof}
+			
 
-These definitions allows us the flexibility to talk about 
-regular expressions in their most convenient format,
-for example, flattened out $\RALTS{[r_1, r_2, \ldots, r_n]} $
-instead of binary-nested: $((r_1 + r_2) + (r_3 + r_4)) + \ldots$.
-These definitions help express that certain classes of syntatically 
-distinct regular expressions are actually the same under simplification.
-This is not entirely true for annotated regular expressions: 
-%TODO: bsimp bders \neq bderssimp
+Like $\textit{createdBySequence}$, we need
+a predicate for ``star-created'' regular expressions:
 \begin{center}
-	$(1+ (c\cdot \ASEQ{bs}{c^*}{c} ))$
-\end{center}
-For bit-codes, the order in which simplification is applied
-might cause a difference in the location they are placed.
-If we want something like
-\begin{center}
-	$\bderssimp{r}{s} \myequiv \bsimp{\bders{r}{s}}$
+	\begin{mathpar}
+		\inferrule{\mbox{}}{ \textit{createdByStar}\; \RSEQ{ra}{\RSTAR{rb}} }
+
+		\inferrule{  \textit{createdByStar} \; r_1\; \land  \; \textit{createdByStar} \; r_2 }{\textit{createdByStar} \; (r_1 + r_2) } 
+	\end{mathpar}
 \end{center}
-Some "canonicalization" procedure is required,
-which either pushes all the common bitcodes to nodes
-as senior as possible:
-\begin{center}
-	$_{bs}(_{bs_1 @ bs'}r_1 + _{bs_1 @ bs''}r_2) \rightarrow _{bs @ bs_1}(_{bs'}r_1 + _{bs''}r_2) $
-\end{center}
-or does the reverse. However bitcodes are not of interest if we are talking about
-the $\llbracket r \rrbracket$ size of a regex.
-Therefore for the ease and simplicity of producing a
-proof for a size bound, we are happy to restrict ourselves to 
-unannotated regular expressions, and obtain such equalities as
-\begin{lemma}
-	$\rsimp{r_1 + r_2} = \rsimp{\RALTS{\hflataux{r_1} @ \hflataux{r_2}}}$
+\noindent
+All regular expressions created by taking derivatives of
+$r_1 \cdot (r_2)^*$ satisfy the $\textit{createdByStar}$ predicate:
+\begin{lemma}\label{starDersCbs}
+	$\textit{createdByStar} \; ((r_1 \cdot r_2^*) \backslash_r s) $ holds.
+\end{lemma}
+\begin{proof}
+	By a reverse induction on $s$.
+\end{proof}
+If a regular expression conforms to the shape of a star's derivative,
+then we can push an application of $\hflataux{\_}$ inside a derivative of it:
+\begin{lemma}\label{hfauPushin}
+	If $\textit{createdByStar} \; r$ holds, then
+	\begin{center}
+		$\hflataux{r \backslash_r c} = \textit{concat} \; (
+		\map \; \hflataux{\_} (\map \; (\_\backslash_r c) \;(\hflataux{r})))$
+	\end{center}
+	holds.
+\end{lemma}
+\begin{proof}
+	By an induction on the inductivev cases of $\textit{createdByStar}$.
+\end{proof}
+%This is not entirely true for annotated regular expressions: 
+%%TODO: bsimp bders \neq bderssimp
+%\begin{center}
+%	$(1+ (c\cdot \ASEQ{bs}{c^*}{c} ))$
+%\end{center}
+%For bit-codes, the order in which simplification is applied
+%might cause a difference in the location they are placed.
+%If we want something like
+%\begin{center}
+%	$\bderssimp{r}{s} \myequiv \bsimp{\bders{r}{s}}$
+%\end{center}
+%Some "canonicalization" procedure is required,
+%which either pushes all the common bitcodes to nodes
+%as senior as possible:
+%\begin{center}
+%	$_{bs}(_{bs_1 @ bs'}r_1 + _{bs_1 @ bs''}r_2) \rightarrow _{bs @ bs_1}(_{bs'}r_1 + _{bs''}r_2) $
+%\end{center}
+%or does the reverse. However bitcodes are not of interest if we are talking about
+%the $\llbracket r \rrbracket$ size of a regex.
+%Therefore for the ease and simplicity of producing a
+%proof for a size bound, we are happy to restrict ourselves to 
+%unannotated regular expressions, and obtain such equalities as
+%TODO: rsimp sflat
+% The simplification of a flattened out regular expression, provided it comes
+%from the derivative of a star, is the same as the one nested.
+
+
+
+Now we introduce an inductive property
+for $\starupdate$ and $\hflataux{\_}$.
+\begin{lemma}\label{starHfauInduct}
+	If we do derivatives of $r^*$
+	with a string that starts with $c$,
+	then flatten it out,
+	we obtain a list
+	of the shape $\sum_{s' \in sS} (r\backslash_r s') \cdot r^*$,
+	where $sS = \starupdates \; s \; r \; [[c]]$. Namely,
+	\begin{center}
+		$\hflataux{(\rders{( (\rder{c}{r_0})\cdot(r_0^*))}{s})} = 
+		\map \; (\lambda s_1. (r_0 \backslash_r s_1) \cdot (r_0^*)) \; 
+		(\starupdates \; s \; r_0 \; [[c]])$
+	\end{center}
+holds.
+\end{lemma}
+\begin{proof}
+	By an induction on $s$, the inductive cases
+	being $[]$ and $s@[c]$. The lemmas \ref{hfauPushin} and \ref{starDersCbs} are used.
+\end{proof}
+\noindent
+
+$\hflataux{\_}$ has a similar effect as $\textit{flatten}$:
+\begin{lemma}\label{hflatauxGrewrites}
+	$a :: rs \grewrites \hflataux{a} @ rs$
+\end{lemma}
+\begin{proof}
+	By induction on $a$. $rs$ is set to take arbitrary values.
+\end{proof}
+It is also not surprising that $\textit{rsimp}$ will cover
+the effect of $\hflataux{\_}$:
+\begin{lemma}\label{cbsHfauRsimpeq1}
+	$\rsimp{(r_1 + r_2)} = \rsimp{(\RALTS{\hflataux{r_1} @ \hflataux{r_2}})}$
 \end{lemma}
 
 \begin{proof}
 	By using the rewriting relation $\rightsquigarrow$
 \end{proof}
-%TODO: rsimp sflat
 And from this we obtain a proof that a star's derivative will be the same
 as if it had all its nested alternatives created during deriving being flattened out:
 For example,
-\begin{lemma}
+\begin{lemma}\label{hfauRsimpeq2}
 	$\createdByStar{r} \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$
 \end{lemma}
 \begin{proof}
-	By structural induction on $r$, where the induction rules are these of $\createdByStar{_}$.
+	By structural induction on $r$, where the induction rules 
+	are these of $\createdByStar{\_}$.
+	Lemma \ref{cbsHfauRsimpeq1} is used in the inductive case.
 \end{proof}
-% The simplification of a flattened out regular expression, provided it comes
-%from the derivative of a star, is the same as the one nested.
-
 
 
-We first introduce an inductive property
-for $\starupdate$ and $\hflataux{\_}$, 
-it says if we do derivatives of $r^*$
-with a string that starts with $c$,
-then flatten it out,
-we obtain a list
-of the shape $\sum_{s' \in sSet} (r\backslash_r s') \cdot r^*$,
-where $sSet = \starupdates \; s \; r \; [[c]]$.
-\begin{lemma}\label{starHfauInduct}
-	$\hflataux{(\rders{( (\rder{c}{r_0})\cdot(r_0^*))}{s})} = 
-	\map \; (\lambda s_1. (r_0 \backslash_r s_1) \cdot (r_0^*)) \; 
-	(\starupdates \; s \; r_0 \; [[c]])$
+%Here is a corollary that states the lemma in
+%a more intuitive way:
+%\begin{corollary}
+%	$\hflataux{r^* \backslash_r (c::xs)} = \map \; (\lambda s. (r \backslash_r s) \cdot
+%	(r^*))\; (\starupdates \; c\; r\; [[c]])$
+%\end{corollary}
+%\noindent
+%Note that this is also agnostic of the simplification
+%function we defined, and is therefore of more general interest.
+
+Together with the rewriting relation
+\begin{lemma}\label{starClosedForm6Hrewrites}
+	We have the following set of rewriting relations or equalities:
+	\begin{itemize}
+		\item
+			$\textit{rsimp} \; (r^* \backslash_r (c::s)) 
+			\sequal
+			\sum \; ( ( \sum (\lambda s. (r\backslash_r s) \cdot r^*) \; (
+			\starupdates \; s \; r \; [ c::[]] ) ) )$
+		\item
+			$r \backslash_{rsimp} (c::s) = \textit{rsimp} \; ( (
+			\sum ( (\map \; (\lambda s_1. (r\backslash s_1) \; r^*) \;
+			(\starupdates \;s \; r \; [ c::[] ])))))$
+		\item
+			$\sum ( (\map \; (\lambda s. (r\backslash s) \; r^*) \; Ss))
+			\sequal
+			 \sum ( (\map \; (\lambda s. \textit{rsimp} \; (r\backslash s) \;
+			 r^*) \; Ss) )$
+		\item
+			$\map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss
+			\scfrewrites
+			\map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss$
+		\item
+			$( ( \sum ( ( \map \ (\lambda s. \;\;
+			(\textit{rsimp} \; (r \backslash_r s)) \cdot r^*) \; (\starupdates \;
+			s \; r \; [ c::[] ])))))$\\
+			$\sequal$\\
+			$( ( \sum ( ( \map \ (\lambda s. \;\;
+			( r \backslash_{rsimp} s)) \cdot r^*) \; (\starupdates \;
+			s \; r \; [ c::[] ]))))$\\
+	\end{itemize}
 \end{lemma}
 \begin{proof}
-	By an induction on $s$, the inductive cases
-	being $[]$ and $s@[c]$.
+	Part 1 leads to part 2.
+	The rest of them are routine.
 \end{proof}
 \noindent
-Here is a corollary that states the lemma in
-a more intuitive way:
-\begin{corollary}
-	$\hflataux{r^* \backslash_r (c::xs)} = \map \; (\lambda s. (r \backslash_r s) \cdot
-	(r^*))\; (\starupdates \; c\; r\; [[c]])$
-\end{corollary}
-\noindent
-Note that this is also agnostic of the simplification
-function we defined, and is therefore of more general interest.
-
-Now adding the $\rsimp{}$ bit for closed forms,
-we have
-\begin{lemma}
-	$a :: rs \grewrites \hflataux{a} @ rs$
-\end{lemma}
-\noindent
-giving us
-\begin{lemma}\label{cbsHfauRsimpeq1}
-	$\rsimp{a+b} = \rsimp{(\sum \hflataux{a} @ \hflataux{b})}$.
-\end{lemma}
-\noindent
-This yields
-\begin{lemma}\label{hfauRsimpeq2}
-	$\rsimp{r} = \rsimp{(\sum \hflataux{r})}$
-\end{lemma}
-\noindent
-Together with the rewriting relation
-\begin{lemma}\label{starClosedForm6Hrewrites}
-	$\map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss
-	\scfrewrites
-	\map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss$
-\end{lemma}
-\noindent
-We obtain the closed form for star regular expression:
-\begin{lemma}\label{starClosedForm}
+Next the closed form for star regular expressions can be derived:
+\begin{theorem}\label{starClosedForm}
 	$\rderssimp{r^*}{c::s} = 
 	\rsimp{
 		(\sum (\map \; (\lambda s. (\rderssimp{r}{s})\cdot r^*) \; 
@@ -1807,11 +2132,14 @@
 		)
 	}
 	$
-\end{lemma}
+\end{theorem}
 \begin{proof}
 	By an induction on $s$.
-	The lemmas \ref{rsimpIdem}, \ref{starHfauInduct}, and \ref{hfauRsimpeq2}
+	The lemmas \ref{rsimpIdem}, \ref{starHfauInduct}, \ref{starClosedForm6Hrewrites} 
+	and \ref{hfauRsimpeq2}
 	are used.	
+	In \ref{starClosedForm6Hrewrites}, the equalities are
+	used to link the LHS and RHS.
 \end{proof}
 
 
@@ -1881,13 +2209,56 @@
 
 In this section, we introduce how we formalised the bound
 on closed forms.
-We first prove that functions such as $\rflts$
+We first show that in general regular expressions up to a certain 
+size are finite.
+Then we prove that functions such as $\rflts$
 will not cause the size of r-regular expressions to grow.
 Putting this together with a general bound 
 on the finiteness of distinct regular expressions
-smaller than a certain size, we obtain a bound on 
+up to a certain size, we obtain a bound on 
 the closed forms.
 
+\subsection{Finiteness of Distinct Regular Expressions}
+We define the set of regular expressions whose size are no more than
+a certain size $N$ as $\textit{sizeNregex} \; N$:
+\[
+	\textit{sizeNregex} \; N \dn \{r\; \mid \;  \llbracket r \rrbracket_r \leq N \}
+\]
+We have that $\textit{sizeNregex} \; N$ is always a finite set:
+\begin{lemma}\label{finiteSizeN}
+	$\textit{finite} \; (\textit{sizeNregex} \; N)$ holds.
+\end{lemma}
+\begin{proof}
+	By splitting the set $\textit{sizeNregex} \; (N + 1)$ into
+	subsets by their categories:
+	$\{\ZERO, \ONE, c\}$, $\{* `` \textit{sizeNregex} \; N\}$,
+	and so on. Each of these subsets are finitely bounded.
+\end{proof}
+\noindent
+From this we get a corollary that
+if forall $r \in rs$, $\rsize{r} \leq N$, then the output of 
+$\rdistinct{rs}{\varnothing}$ is a list of regular
+expressions of finite size depending on $N$ only. 
+\begin{corollary}\label{finiteSizeNCorollary}
+	$\rsize{\rdistinct{rs}{\varnothing}} \leq c_N * N$ holds,
+	where the constant $c_N$ is equal to $\textit{card} \; (\textit{sizeNregex} \;
+	N)$.
+\end{corollary}
+\begin{proof}
+	For all $r$ in 
+	$\textit{set} \; (\rdistinct{rs}{\varnothing})$,
+	it is always the case that $\rsize{r} \leq N$.
+	In addition, the list length is bounded by
+	$c_N$, yielding the desired bound.
+\end{proof}
+\noindent
+This fact will be handy in estimating the closed form sizes.
+%We have proven that the size of the
+%output of $\textit{rdistinct} \; rs' \; \varnothing$
+%is bounded by a constant $N * c_N$ depending only on $N$,
+%provided that each of $rs'$'s element
+%is bounded by $N$.
+
 \subsection{$\textit{rsimp}$ Does Not Increment the Size}
 Although it seems evident, we need a series
 of non-trivial lemmas to establish that functions such as $\rflts$
@@ -1944,7 +2315,8 @@
 \end{proof}
 
 \subsection{Estimating the Closed Forms' sizes}
-We now summarize the closed forms below:
+We recap the closed forms we obtained
+earlier by putting them together down below:
 \begin{itemize}
 	\item
 		$\rderssimp{(\sum rs)}{s} \sequal
@@ -1978,44 +2350,10 @@
 
 We elaborate the above reasoning by a series of lemmas
 below, where straightforward proofs are omitted.
-\begin{lemma}
-	If $\forall r \in rs. \rsize{r} $ is less than or equal to $N$,
-	and $\textit{length} \; rs$ is less than or equal to $l$,
-	then $\rsize{\sum rs}$ is less than or equal to $l*N + 1$.
-\end{lemma}
-\noindent
-If we define all regular expressions with size no
-more than $N$ as $\sizeNregex \; N$:
-\[
-	\sizeNregex \; N \dn  \{r \mid \rsize{r} \leq N \}
-\]
-Then such set is finite:
-\begin{lemma}\label{finiteSizeN}
-	$\textit{isFinite}\; (\sizeNregex \; N)$
-\end{lemma}
-\begin{proof}
-	By overestimating the set $\sizeNregex \; N + 1$
-	using union of sets like
-	$\{r_1 \cdot r_2 \mid r_1 \in A
-		\text{and}
-	r_2 \in A\}
-	$ where $A = \sizeNregex \; N$.
-\end{proof}
-\noindent
-From this we get a corollary that
-if forall $r \in rs$, $\rsize{r} \leq N$, then the output of 
-$\rdistinct{rs}{\varnothing}$ is a list of regular
-expressions of finite size depending on $N$ only. 
-\begin{corollary}\label{finiteSizeNCorollary}
-	Assumes that for all $r \in rs. \rsize{r} \leq N$,
-	and the cardinality of $\sizeNregex \; N$ is $c_N$
-	then$\rsize{\rdistinct{rs}{\varnothing}} \leq c*N$.
-\end{corollary}
-\noindent
-We have proven that the output of $\rdistinct{rs'}{\varnothing}$
-is bounded by a constant $c_N$ depending only on $N$,
-provided that each of $rs'$'s element
-is bounded by $N$.
+
+
+
+
 We want to apply it to our setting $\rsize{\rsimp{\sum rs}}$.
 
 We show that $\rdistinct$ and $\rflts$
@@ -2041,20 +2379,20 @@
 From
 \begin{center}
 $\llbracket  \textit{rflts}\; rs \rrbracket_r \leq
-	\llbracket \; \textit{rs} \rrbracket_r$
+	\llbracket  \textit{rs} \rrbracket_r$
 \end{center}
 and
 \begin{center}
      $\llbracket  \textit{rdistinct} \; rs \; \varnothing \leq
      \llbracket rs \rrbracket_r$
 \end{center}
-one cannot imply
+one cannot infer 
 \begin{center}
 	$\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r 
 	\leq 
 	\llbracket \rdistinct{rs}{\varnothing}  \rrbracket_r  $.
 \end{center}
-What we can imply is that 
+What we can infer is that 
 \begin{center}
 	$\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r 
 	\leq
@@ -2062,7 +2400,7 @@
 \end{center}
 but this estimate is too rough and $\llbracket rs \rrbracket_r$	is unbounded.
 The way we 
-get through this is by first proving a more general lemma 
+get around this is by first proving a more general lemma 
 (so that the inductive case goes through):
 \begin{lemma}\label{fltsSizeReductionAlts}
 	If we have three accumulator sets:
@@ -2087,11 +2425,17 @@
 		holds.
 \end{lemma}
 \noindent
-We need to split the accumulator into two parts: the part
+We split the accumulator into two parts: the part
 which contains alternative regular expressions ($alts\_set$), and 
 the part without any of them($noalts\_set$).
+This is because $\rflts$ opens up the alternatives in $as$,
+causing the accumulators on both sides of the inequality
+to diverge slightly.
+If we want to compare the accumulators that are not
+perfectly in sync, we need to consider the alternatives and non-alternatives
+separately.
 The set $corr\_set$ is the corresponding set
-of $alts\_set$ with all elements under the $\sum$ constructor
+of $alts\_set$ with all elements under the alternative constructor
 spilled out.
 \begin{proof}
 	By induction on the list $as$. We make use of lemma \ref{rdistinctConcat}.
@@ -2106,7 +2450,8 @@
 	By using the lemma \ref{fltsSizeReductionAlts}.
 \end{proof}
 \noindent
-The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of 
+The intuition for why this is true
+is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of 
 duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$. 
 
 Now this $\rsimp{\sum rs}$ can be estimated using $\rdistinct{rs}{\varnothing}$:
@@ -2114,14 +2459,14 @@
 	$\rsize{\rsimp{\sum rs}} \leq \rsize{\rdistinct{rs}{\varnothing}}+ 1$
 \end{lemma}
 \begin{proof}
-	By using \ref{interactionFltsDB}.
+	By using corollary \ref{interactionFltsDB}.
 \end{proof}
 \noindent
 This is a key lemma in establishing the bounds on all the 
 closed forms.
 With this we are now ready to control the sizes of
-$(r_1 \cdot r_2 )\backslash s$, $r^* \backslash s$.
-\begin{theorem}
+$(r_1 \cdot r_2 )\backslash s$ and $r^* \backslash s$.
+\begin{theorem}\label{rBound}
 	For any regex $r$, $\exists N_r. \forall s. \; \rsize{\rderssimp{r}{s}} \leq N_r$
 \end{theorem}
 \noindent
@@ -2146,7 +2491,7 @@
 \end{tabular}
 \end{center}
 \noindent
-(1) is by the corollary \ref{seqEstimate1}.
+(1) is by theorem \ref{seqClosedForm}.
 (2) is by \ref{altsSimpControl}.
 (3) is by \ref{finiteSizeNCorollary}.
 
@@ -2183,9 +2528,9 @@
 \end{tabular}
 \end{center}
 \noindent
-(5) is by the lemma  \ref{starClosedForm}.
+(5) is by theorem \ref{starClosedForm}.
 (6) is by \ref{altsSimpControl}.
-(7) is by \ref{finiteSizeNCorollary}.
+(7) is by corollary \ref{finiteSizeNCorollary}.
 Combining with the case when $s = []$, one gets
 \begin{center}
 	\begin{tabular}{lcll}
@@ -2208,7 +2553,7 @@
 	\end{tabular}
 \end{center}
 \noindent
-(9) is by \ref{altsClosedForm}, (10) by \ref{rsimpSize} and (11) by inductive hypothesis.
+(9) is by theorem \ref{altsClosedForm}, (10) by lemma \ref{rsimpMono} and (11) by inductive hypothesis.
 
 Combining with the case when $s = []$, one gets
 \begin{center}
@@ -2217,15 +2562,15 @@
 						 & (12)\\
 	\end{tabular}
 \end{center}
-(4), (8), and (12) are all the inductive cases proven.
+We have all the inductive cases proven.
 \end{proof}
 
-
+This leads to our main result on the size bound:
 \begin{corollary}
-	For any regex $a$, $\exists N_r. \forall s. \; \rsize{\bderssimp{a}{s}} \leq N_r$
+	For any annotated regular expression $a$, $\exists N_r. \forall s. \; \rsize{\bderssimp{a}{s}} \leq N_r$
 \end{corollary}
 \begin{proof}
-	By \ref{sizeRelations}.
+	By lemma \ref{sizeRelations} and theorem \ref{rBound}.
 \end{proof}
 \noindent
 
@@ -2243,22 +2588,9 @@
 %----------------------------------------------------------------------------------------
 
 
-\subsection{A Closed Form for the Sequence Regular Expression}
-\noindent
-
-Before we get to the proof that says the intermediate result of our lexer will
-remain finitely bounded, which is an important efficiency/liveness guarantee,
-we shall first develop a few preparatory properties and definitions to 
-make the process of proving that a breeze.
-
-We define rewriting relations for $\rrexp$s, which allows us to do the 
-same trick as we did for the correctness proof,
-but this time we will have stronger equalities established.
-
-
-
+\section{Comments and Future Improvements}
+\subsection{Some Experimental Results}
 What guarantee does this bound give us?
-
 Whatever the regex is, it will not grow indefinitely.
 Take our previous example $(a + aa)^*$ as an example:
 \begin{center}
@@ -2278,7 +2610,7 @@
 				width=5cm,
 				height=4cm, 
 				legend entries={$(a + aa)^*$},  
-				legend pos=north west,
+				legend pos=south east,
 				legend cell align=left]
 				\addplot[red,mark=*, mark options={fill=white}] table {a_aa_star.data};
 			\end{axis}
@@ -2297,22 +2629,16 @@
 $f(x) = x * 2^x$.
 This means the bound we have will surge up at least
 tower-exponentially with a linear increase of the depth.
-For a regex of depth $n$, the bound
-would be approximately $4^n$.
 
-Test data in the graphs from randomly generated regular expressions
-shows that the giant bounds are far from being hit.
-%a few sample regular experessions' derivatives
-%size change
-%TODO: giving regex1_size_change.data showing a few regular expressions' size changes 
-%w;r;t the input characters number, where the size is usually cubic in terms of original size
-%a*, aa*, aaa*, .....
-%randomly generated regular expressions
-\begin{figure}{H}
-	\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
+One might be quite skeptical about what this non-elementary
+bound can bring us.
+It turns out that the giant bounds are far from being hit.
+Here we have some test data from randomly generated regular expressions:
+\begin{figure}[H]
+	\begin{tabular}{@{}c@{\hspace{2mm}}c@{\hspace{0mm}}c@{}}
 		\begin{tikzpicture}
 			\begin{axis}[
-				xlabel={number of characters},
+				xlabel={$n$},
 				x label style={at={(1.05,-0.05)}},
 				ylabel={regex size},
 				enlargelimits=false,
@@ -2322,15 +2648,15 @@
 				%ytick={0,100,...,1000},
 				scaled ticks=false,
 				axis lines=left,
-				width=5cm,
-				height=4cm, 
+				width=4.75cm,
+				height=3.8cm, 
 				legend entries={regex1},  
-				legend pos=north west,
+				legend pos=north east,
 				legend cell align=left]
 				\addplot[red,mark=*, mark options={fill=white}] table {regex1_size_change.data};
 			\end{axis}
 		\end{tikzpicture}
-  &
+ & 
   \begin{tikzpicture}
 	  \begin{axis}[
 		  xlabel={$n$},
@@ -2343,15 +2669,15 @@
 		  %ytick={0,100,...,1000},
 		  scaled ticks=false,
 		  axis lines=left,
-		  width=5cm,
-		  height=4cm, 
+		  width=4.75cm,
+		  height=3.8cm, 
 		  legend entries={regex2},  
-		  legend pos=north west,
+		  legend pos=south east,
 		  legend cell align=left]
 		  \addplot[blue,mark=*, mark options={fill=white}] table {regex2_size_change.data};
 	  \end{axis}
   \end{tikzpicture}
-  &
+ & 
   \begin{tikzpicture}
 	  \begin{axis}[
 		  xlabel={$n$},
@@ -2364,16 +2690,19 @@
 		  %ytick={0,100,...,1000},
 		  scaled ticks=false,
 		  axis lines=left,
-		  width=5cm,
-		  height=4cm, 
+		  width=4.75cm,
+		  height=3.8cm, 
 		  legend entries={regex3},  
-		  legend pos=north west,
+		  legend pos=south east,
 		  legend cell align=left]
 		  \addplot[cyan,mark=*, mark options={fill=white}] table {regex3_size_change.data};
 	  \end{axis}
   \end{tikzpicture}\\
-  \multicolumn{3}{c}{Graphs: size change of 3 randomly generated regular expressions $w.r.t.$ input string length.}
+  \multicolumn{3}{c}{}
 	\end{tabular}    
+  \caption{Graphs: size change of 3 randomly generated 
+  regular expressions $w.r.t.$ input string length. 
+  The x axis represents the length of input.}
 \end{figure}  
 \noindent
 Most of the regex's sizes seem to stay within a polynomial bound $w.r.t$ the 
@@ -2382,24 +2711,22 @@
 
 
 
-\section{Possible Further Improvements}
-There are two problems with this finiteness result, though.
-\begin{itemize}
-	\item
-		First, It is not yet a direct formalisation of our lexer's complexity,
+\subsection{Possible Further Improvements}
+There are two problems with this finiteness result, though.\\
+(i)	
+		First, it is not yet a direct formalisation of our lexer's complexity,
 		as a complexity proof would require looking into 
 		the time it takes to execute {\bf all} the operations
-		involved in the lexer (simp, collect, decode), not just the derivative.
-	\item
+		involved in the lexer (simp, collect, decode), not just the derivative.\\
+(ii)
 		Second, the bound is not yet tight, and we seek to improve $N_a$ so that
-		it is polynomial on $\llbracket a \rrbracket$.
-\end{itemize}
-Still, we believe this contribution is fruitful,
+		it is polynomial on $\llbracket a \rrbracket$.\\
+Still, we believe this contribution is useful,
 because
 \begin{itemize}
 	\item
 
-		The size proof can serve as a cornerstone for a complexity
+		The size proof can serve as a starting point for a complexity
 		formalisation.
 		Derivatives are the most important phases of our lexer algorithm.
 		Size properties about derivatives covers the majority of the algorithm
@@ -2435,8 +2762,11 @@
 would still be slow.
 And unfortunately, we have concrete examples
 where such regular expressions grew exponentially large before levelling off:
+\begin{center}
 $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + 
-(\underbrace{a \ldots a}_{\text{n a's}})^*$ will already have a maximum
+(\underbrace{a \ldots a}_{\text{n a's}})^*)^*$ 
+\end{center}
+will already have a maximum
 size that is  exponential on the number $n$ 
 under our current simplification rules:
 %TODO: graph of a regex whose size increases exponentially.
@@ -2456,38 +2786,48 @@
 	\end{tikzpicture}
 \end{center}
 
-For convenience we use $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$
+For convenience we use $(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$
 to express $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + 
 (\underbrace{a \ldots a}_{\text{n a's}})^*$ in the below discussion.
 The exponential size is triggered by that the regex
-$\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$
+$\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$
 inside the $(\ldots) ^*$ having exponentially many
 different derivatives, despite those difference being minor.
-$(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*\backslash \underbrace{a \ldots a}_{\text{m a's}}$
+$(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*\backslash \underbrace{a \ldots a}_{\text{m a's}}$
 will therefore contain the following terms (after flattening out all nested 
 alternatives):
 \begin{center}
-	$(\oplus_{i = 1]{n}  (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* })\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\
+$(\sum_{i = 1}^{n}  (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\
+[1mm]
 	$(1 \leq m' \leq m )$
 \end{center}
-These terms are distinct for $m' \leq L.C.M.(1, \ldots, n)$ (will be explained in appendix).
+There at at least exponentially
+many such terms.\footnote{To be exact, these terms are 
+distinct for $m' \leq L.C.M.(1, \ldots, n)$, the details are omitted,
+but the point is that the number is exponential.
+} 
 With each new input character taking the derivative against the intermediate result, more and more such distinct
-terms will accumulate, 
-until the length reaches $L.C.M.(1, \ldots, n)$.
-$\textit{distinctBy}$ will not be able to de-duplicate any two of these terms 
-$(\oplus_{i = 1}^{n}  (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\
-
-$(\oplus_{i = 1}^{n}  (\underbrace{a \ldots a}_{\text{((i - (m'' \% i))\%i) a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\
-where $m' \neq m''$ \\
+terms will accumulate.
+The function $\textit{distinctBy}$ will not be able to de-duplicate any two of these terms 
+\begin{center}
+$(\sum_{i = 1}^{n}  
+(\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot  
+(\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot 
+(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\
+$(\sum_{i = 1}^{n}  (\underbrace{a \ldots a}_{\text{((i - (m'' \% i))\%i) a's}})\cdot  
+(\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot 
+(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$
+\end{center}
+\noindent
+where $m' \neq m''$
 as they are slightly different.
 This means that with our current simplification methods,
 we will not be able to control the derivative so that
-$\llbracket \bderssimp{r}{s} \rrbracket$ stays polynomial %\leq O((\llbracket r\rrbacket)^c)$
-as there are already exponentially many terms.
+$\llbracket \bderssimp{r}{s} \rrbracket$ stays polynomial. %\leq O((\llbracket r\rrbacket)^c)$
 These terms are similar in the sense that the head of those terms
 are all consisted of sub-terms of the form: 
 $(\underbrace{a \ldots a}_{\text{j a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* $.
-For  $\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$, there will be at most
+For  $\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$, there will be at most
 $n * (n + 1) / 2$ such terms. 
 For example, $(a^* + (aa)^* + (aaa)^*) ^*$'s derivatives
 can be described by 6 terms:
@@ -2495,15 +2835,29 @@
 $aa \cdot (aaa)^*$, $a \cdot (aaa)^*$, and $(aaa)^*$.
 The total number of different "head terms",  $n * (n + 1) / 2$,
 is proportional to the number of characters in the regex 
-$(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$.
-This suggests a slightly different notion of size, which we call the 
+$(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$.
+If we can improve our deduplication process so that it becomes smarter
+and only keep track of these $n * (n+1) /2$ terms, then we can keep
+the size growth polynomial again.
+This example also suggests a slightly different notion of size, which we call the 
 alphabetic width:
-%TODO:
-(TODO: Alphabetic width def.)
+\begin{center}
+	\begin{tabular}{lcl}
+		$\textit{awidth} \; \ZERO$ & $\dn$ & $0$\\
+		$\textit{awidth} \; \ONE$ & $\dn$ & $0$\\
+		$\textit{awidth} \; c$ & $\dn$ & $1$\\
+		$\textit{awidth} \; r_1 + r_2$ & $\dn$ & $\textit{awidth} \; 
+		r_1 + \textit{awidth} \; r_2$\\
+		$\textit{awidth} \; r_1 \cdot r_2$ & $\dn$ & $\textit{awidth} \;
+		r_1 + \textit{awidth} \; r_2$\\
+		$\textit{awidth} \; r^*$ & $\dn$ & $\textit{awidth} \; r$\\
+	\end{tabular}
+\end{center}
+
 
 
 Antimirov\parencite{Antimirov95} has proven that 
-$\textit{PDER}_{UNIV}(r) \leq \textit{awidth}(r)$.
+$\textit{PDER}_{UNIV}(r) \leq \textit{awidth}(r)$,
 where $\textit{PDER}_{UNIV}(r)$ is a set of all possible subterms
 created by doing derivatives of $r$ against all possible strings.
 If we can make sure that at any moment in our lexing algorithm our 
@@ -2523,12 +2877,12 @@
 %-----------------------------------
 %	SUBSECTION 1
 %-----------------------------------
-\subsection{Syntactic Equivalence Under $\simp$}
-We prove that minor differences can be annhilated
-by $\simp$.
-For example,
-\begin{center}
-	$\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) = 
-	\simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$
-\end{center}
+%\subsection{Syntactic Equivalence Under $\simp$}
+%We prove that minor differences can be annhilated
+%by $\simp$.
+%For example,
+%\begin{center}
+%	$\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) = 
+%	\simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$
+%\end{center}

--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex	Thu Oct 13 23:55:25 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex	Sun Nov 06 23:06:10 2022 +0000
@@ -780,13 +780,13 @@
 %	\end{tabular}
 %\end{center}
 A concrete example
-for back-references would be
+for back-references is
 \begin{center}
 $(.^*)\backslash 1$,
 \end{center}
-which would match 
+which matches
 strings that can be split into two identical halves,
-for example $\mathit{foofoo}$, $\mathit{ww}$ and etc.
+for example $\mathit{foofoo}$, $\mathit{ww}$ and so on.
 Note that this is different from 
 repeating the  sub-expression verbatim like
 \begin{center}
@@ -795,32 +795,32 @@
 which does not impose any restrictions on what strings the second 
 sub-expression $.^*$
 might match.
-Another example of back-references would be
+Another example of back-references is
 \begin{center}
 $(.)(.)\backslash 2\backslash 1$
 \end{center}
-which expresses four-character palindromes
-like $abba$, $x??x$ etc.
+which matches four-character palindromes
+like $abba$, $x??x$ and so on.
 
 Back-references is a regex construct 
-that programmers found quite useful.
+that programmers find quite useful.
 According to Becchi and Crawley\cite{Becchi08},
-6\% of Snort rules (up until 2008) include the use of them.
+6\% of Snort rules (up until 2008) use them.
 The most common use of back-references
-would be expressing well-formed html files,
-where back-references would be handy in expressing
-a pair of opening and closing tags like 
+is to express well-formed html files,
+where back-references are convenient for matching
+opening and closing tags like 
 \begin{center}
 	$\langle html \rangle \ldots \langle / html \rangle$
 \end{center}
 A regex describing such a format
-could be
+is
 \begin{center}
 	$\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$
 \end{center}
-Despite being useful, the syntax and expressive power of regexes 
+Despite being useful, the expressive power of regexes 
 go beyond the regular language hierarchy
-with back-references.
+once back-references are included.
 In fact, they allow the regex construct to express 
 languages that cannot be contained in context-free
 languages either.
@@ -829,17 +829,15 @@
 which cannot be expressed by context-free grammars\parencite{campeanu2003formal}.
 Such a language is contained in the context-sensitive hierarchy
 of formal languages. 
-Solving the back-reference expressions matching problem
+Also solving the matching problem involving back-references
 is known to be NP-complete \parencite{alfred2014algorithms}.
-A non-bactracking,
-efficient solution is not known to exist.
 Regex libraries supporting back-references such as 
 PCRE \cite{pcre} therefore have to
 revert to a depth-first search algorithm which backtracks.
 What is unexpected is that even in the cases 
 not involving back-references, there is still
 a (non-negligible) chance they might backtrack super-linearly,
-as shown in the graphs in \ref{fig:aStarStarb}.
+as shown in the graphs in figure\ref{fig:aStarStarb}.
 
 \subsection{Summary of the Catastrophic Backtracking Problem}
 Summing these up, we can categorise existing

--- a/ChengsongTanPhdThesis/example.bib	Thu Oct 13 23:55:25 2022 +0100
+++ b/ChengsongTanPhdThesis/example.bib	Sun Nov 06 23:06:10 2022 +0000
@@ -76,7 +76,16 @@
 
 
 %% Brzozowski ders------------------------
-
+@article{Murugesan2014,
+  author    = {N.~Murugesan and O.~V.~Shanmuga Sundaram},
+  title     = {{S}ome {P}roperties of {B}rzozowski {D}erivatives of {R}egular {E}xpressions},
+  journal   = {International Journal of Computer Trends and Technology},
+  volume    = {13},
+  number    = {1},
+  year      = {2014},
+  url       = {http://arxiv.org/abs/1407.5902},
+  pages     = {29--33}
+}
 
 %% look-aheads------------------------
 @article{Takayuki2019,

--- a/thys4/posix/FBound.thy	Thu Oct 13 23:55:25 2022 +0100
+++ b/thys4/posix/FBound.thy	Sun Nov 06 23:06:10 2022 +0000
@@ -640,7 +640,7 @@
   sorry
 
 
-lemma compl_rrewrite_down1:
+lemma compl_rrewrite_down1_1:
   shows "\<lbrakk>r1 \<le>1 r2; s_complexity r1 = s_complexity r2 \<rbrakk> \<Longrightarrow> r1 = r2"
   using compl_rrewrite_down nat_less_le by auto

author	Chengsong
	Sun, 06 Nov 2022 23:06:10 +0000
changeset 619	2072a8d54e3e
parent 618	233cf2b97d1a (diff)
parent 617	3ea6a38c33b5 (current diff)
child 620	ae6010c14e49