ChengsongTanPhdThesis/Chapters/Finite.tex
changeset 640 bd1354127574
parent 639 80cc6dc4c98b
child 659 2e05f04ed6b3
--- a/ChengsongTanPhdThesis/Chapters/Finite.tex	Fri Dec 30 17:37:51 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Finite.tex	Fri Dec 30 23:41:44 2022 +0000
@@ -8,12 +8,12 @@
 %regex's derivatives. 
 
 
-In this chapter we give a bound in terms of size of 
+In this chapter we give a bound in terms of the size of 
 the calculated derivatives: 
 given an annotated regular expression $a$, for any string $s$
 our algorithm $\blexersimp$'s derivatives
 are finitely bounded
-by a constant  that only depends on $a$.
+by a constant that only depends on $a$.
 Formally we show that there exists an $N_a$ such that
 \begin{center}
 	$\llbracket \bderssimp{a}{s} \rrbracket \leq N_a$
@@ -22,9 +22,9 @@
 where the size ($\llbracket \_ \rrbracket$) of 
 an annotated regular expression is defined
 in terms of the number of nodes in its 
-tree structure (its recursive definition given in the next page).
+tree structure (its recursive definition is given in the next page).
 We believe this size bound
-is important in the context of POSIX lexing, because 
+is important in the context of POSIX lexing because 
 \begin{itemize}
 	\item
 		It is a stepping stone towards the goal 
@@ -37,8 +37,8 @@
 		is not just finite but polynomial in $\llbracket a\rrbracket$.
 	\item
 		Having the finite bound formalised 
-		gives us a higher confidence that
-		our simplification algorithm $\simp$ does not ``mis-behave''
+		gives us higher confidence that
+		our simplification algorithm $\simp$ does not ``misbehave''
 		like $\textit{simpSL}$ does.
 		The bound is universal for a given regular expression, 
 		which is an advantage over work which 
@@ -121,7 +121,7 @@
 stay below a size bound $N_a$ determined by the input $a$.
 
 \noindent
-Sulzmann and Lu's assumed a similar picture about their algorithm,
+Sulzmann and Lu's assumed a similar picture of their algorithm,
 though in fact their algorithm's size might be better depicted by the following graph:
 \begin{figure}[H]
 	\begin{tikzpicture}[scale=2,
@@ -162,11 +162,11 @@
 They tested out the run time of their
 lexer on particular examples such as $(a+b+ab)^*$
 and claimed that their algorithm is linear w.r.t to the input.
-With our mecahnised proof, we avoid this type of unintentional
+With our mechanised proof, we avoid this type of unintentional
 generalisation.
 
-Before delving in to the details of the formalisation,
-we are going to provide an overview of it in the next subsection.
+Before delving into the details of the formalisation,
+we are going to provide an overview of it in the following subsection.
 
 
 \subsection{Overview of the Proof}
@@ -236,7 +236,7 @@
 allowing a more convenient size bound proof.
 %Of course, the bits which encode the lexing information 
 %would grow linearly with respect 
-%to the input, which should be taken into account when we wish to tackle the runtime comlexity.
+%to the input, which should be taken into accounte when we wish to tackle the runtime complexity.
 %But for the sake of the structural size 
 %we can safely ignore them.\\
 The datatype 
@@ -269,10 +269,7 @@
 \noindent
 The $r$ in the subscript of $\llbracket \rrbracket_r$ is to 
 differentiate with the same operation for annotated regular expressions.
-Adding $r$ as subscript will be used in 
-other operations as well.\\
-The transformation from an annotated regular expression
-to an r-regular expression is straightforward.
+Similar subscripts will be added for operations like $\rerase{}$:
 \begin{center}
 	\begin{tabular}{lcl}
 		$\rerase{\ZERO}$ & $\dn$ & $\RZERO$\\
@@ -314,7 +311,7 @@
 The  annotated regular expression $\sum[a, b, c]$ would turn into
 $(a+(b+c))$.
 All these operations change the size and structure of
-an annotated regular expressions, adding unnecessary 
+an annotated regular expression, adding unnecessary 
 complexities to the size bound proof.
 
 For example, if we define the size of a basic plain regular expression 
@@ -522,9 +519,9 @@
 %\begin{itemize}
 %	\item
 %		First, we rewrite $r\backslash s$ into something else that is easier
-%		to bound. This step is especially important for the inductive case 
+%		to bound. This step is crucial for the inductive case 
 %		$r_1 \cdot r_2$ and $r^*$, where the derivative can grow and bloat in a wild way,
-%		but after simplification they will always be equal or smaller to a form consisting of an alternative
+%		but after simplification, they will always be equal or smaller to a form consisting of an alternative
 %		list of regular expressions $f \; (g\; (\sum rs))$ with some functions applied to it, where each element will be distinct after the function application.
 %	\item
 %		Then, for such a sum  list of regular expressions $f\; (g\; (\sum rs))$, we can control its size
@@ -580,7 +577,7 @@
 in the simplification, which prevents the $rs$ from growing indefinitely.
 
 Based on this idea, we develop a proof in two steps.
-First, we show the equality (where
+First, we show the below equality (where
 $f$ and $g$ are functions that do not increase the size of the input)
 \begin{center}
 $r\backslash_{rsimps} s = f\; (\textit{rdistinct} \; (g\; \sum rs))$,
@@ -604,7 +601,7 @@
 
 \section{Closed Forms}
 In this section we introduce in detail
-how we express the string derivatives
+how to express the string derivatives
 of regular expressions (i.e. $r \backslash_r s$ where $s$ is a string
 rather than a single character) in a different way than 
 our previous definition.
@@ -623,16 +620,16 @@
 We call such more concrete representations the ``closed forms'' of
 string derivatives as opposed to their original definitions.
 The terminology ``closed form'' is borrowed from mathematics,
-which usually describe expressions that are solely comprised of
+which usually describe expressions that are solely comprised of finitely many
 well-known and easy-to-compute operations such as 
-additions, multiplications, exponential functions.
+additions, multiplications, and exponential functions.
 
 We start by proving some basic identities
 involving the simplification functions for r-regular expressions.
 After that we introduce the rewrite relations
 $\rightsquigarrow_h$, $\rightsquigarrow^*_{scf}$
 $\rightsquigarrow_f$ and $\rightsquigarrow_g$.
-These relations involves similar techniques as in chapter \ref{Bitcoded2}
+These relations involve similar techniques as in chapter \ref{Bitcoded2}
 for annotated regular expressions.
 Finally, we use these identities to establish the
 closed forms of the alternative regular expression,
@@ -645,14 +642,14 @@
 
 In what follows we will often convert between lists
 and sets.
-We use Isabelle's $set$ to refere to the 
+We use Isabelle's $set$ to refer to the 
 function that converts a list $rs$ to the set
 containing all the elements in $rs$.
 \subsubsection{$\textit{rdistinct}$'s Does the Job of De-duplication}
 The $\textit{rdistinct}$ function, as its name suggests, will
 de-duplicate an r-regular expression list.
 It will also remove any elements that 
-is already in the accumulator set.
+are already in the accumulator set.
 \begin{lemma}\label{rdistinctDoesTheJob}
 	%The function $\textit{rdistinct}$ satisfies the following
 	%properties:
@@ -675,7 +672,7 @@
 \noindent
 \begin{proof}
 	The first part is by an induction on $rs$.
-	The second and third part can be proven by using the 
+	The second and third parts can be proven by using the 
 	inductive cases of $\textit{rdistinct}$.
 
 \end{proof}
@@ -804,16 +801,16 @@
 We give in this subsection some properties
 involving $\backslash_r$, $\backslash_{rsimps}$, $\textit{rflts}$ and 
 $\textit{rsimp}_{ALTS} $, together with any non-trivial lemmas that lead to them.
-These will be helpful in later closed form proofs, when
+These will be helpful in later closed-form proofs, when
 we want to transform derivative terms which have
 %the ways in which multiple functions involving
 %those are composed together
-interleaving derivatives and  simplifications applied to them.
+interleaving derivatives and simplifications applied to them.
 
 \noindent
 %When the function $\textit{Rflts}$ 
-%is applied to the concatenation of two lists, the output can be calculated by first applying the
-%functions on two lists separately, and then concatenating them together.
+%is applied to the concatenation of two lists; the output can be calculated by first applying the
+%functions on two lists separately and then concatenating them together.
 $\textit{Rflts}$ is composable in terms of concatenation:
 \begin{lemma}\label{rfltsProps}
 	The function $\rflts$ has the properties below:\\
@@ -919,10 +916,10 @@
 which evaluates to true when the regular expression is not an
 alternative, and false otherwise.
 The $\good$ property is preserved under $\rsimp_{ALTS}$, provided that
-its non-empty argument list of expressions are all good themsleves, and $\textit{nonAlt}$, 
+its non-empty argument list of expressions are all good themselves, and $\textit{nonAlt}$, 
 and unique:
 \begin{lemma}\label{rsimpaltsGood}
-	If $rs \neq []$ and forall $r \in rs. \textit{nonAlt} \; r$ and $\textit{isDistinct} \; rs$,
+	If $rs \neq []$ and for all $r \in rs. \textit{nonAlt} \; r$ and $\textit{isDistinct} \; rs$,
 	then $\good \; (\rsimpalts \; rs)$ if and only if forall $r \in rs. \; \good \; r$.
 \end{lemma}
 \noindent
@@ -962,7 +959,7 @@
 	By induction on $r$.
 \end{proof}
 \noindent
-With this we can prove that a regular expressions
+With this we can prove that a regular expression
 after simplification and flattening and de-duplication,
 will not contain any alternative regular expression directly:
 \begin{lemma}\label{nonaltFltsRd}
@@ -995,8 +992,8 @@
 	\end{itemize}
 \end{lemma}
 \begin{proof}
-	The first part is by induction on $rs$, where the induction
-	rule is the inductive cases for $\rflts$.
+	The first part is by induction, where the inductive cases
+	are the inductive cases of $\rflts$.
 	The second part is a corollary from the first part.
 \end{proof}
 
@@ -1012,7 +1009,7 @@
 	$\llbracket \rsimp{r}\rrbracket_r$ is smaller than or equal to
 	$\llbracket r \rrbracket_r$.
 	Therefore, in the $r_1 \cdot r_2$ and $\sum rs$ case,
-	Inductive hypothesis applies to the children regular expressions
+	The inductive hypothesis applies to the children regular expressions
 	$r_1$, $r_2$, etc. The lemma \ref{flts3Obv}'s precondition is satisfied
 	by that as well.
 	The lemmas \ref{nonnestedRsimp} and  \ref{nonaltFltsRd} are used
@@ -1101,7 +1098,7 @@
 	\end{itemize}
 \end{corollary}
 \noindent
-This will be useful in the later closed form proof's rewriting steps.
+This will be useful in the later closed-form proof's rewriting steps.
 Similarly, we state the following useful facts below:
 \begin{lemma}
 	The following equalities hold if $r = \rsimp{r'}$ for some $r'$:
@@ -1442,7 +1439,7 @@
 	If $r \hrewrites r' $ then $\rder{c}{r} \hrewrites \rder{c}{r'}$
 \end{lemma}
 \noindent
-This allows us proving more $\mathbf{rsimp}$-equivalent terms, involving $\backslash_r$.
+This allows us to prove more $\mathbf{rsimp}$-equivalent terms, involving $\backslash_r$.
 \begin{lemma}\label{insideSimpRemoval}
 	$\rsimp{(\rder{c}{(\rsimp{r})})} = \rsimp{(\rder{c}{r})}  $
 \end{lemma}
@@ -1530,7 +1527,7 @@
 		$   \longrightarrow_{\backslash c''} \quad \ldots$\\
 	\end{tabular}
 \end{center}
-Roughly speaking $r_1 \cdot r_2 \backslash s$ can be expresssed as 
+Roughly speaking $r_1 \cdot r_2 \backslash s$ can be expressed as 
 a giant alternative taking a list of terms 
 $[r_1 \backslash_r s \cdot r_2, r_2 \backslash_r s'', r_2 \backslash_r s_1'', \ldots]$,
 where the head of the list is always the term
@@ -1600,13 +1597,13 @@
 and only generate the 
 $r_2 \backslash_r s''$ terms satisfying the property
 \begin{center}
-$\exists  s'.  such \; that \; s'@s'' = s \;\; \land \;\;
+$\exists s'.  such \; that \; s'@s'' = s \;\; \land \;\;
 r_1 \backslash s' \; is \; nullable$.
 \end{center}
 Given the arguments $s$ and $r_1$, we denote the list of strings
 $s''$ satisfying the above property as $\vsuf{s}{r_1}$.
 The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string\footnote{
-	Perhaps a better name of it would be ``NullablePrefixSuffix''
+	Perhaps a better name for it would be ``NullablePrefixSuffix''
 	to differentiate with the list of \emph{all} prefixes of $s$, but
 	that is a bit too long for a function name and we are yet to find
 a more concise and easy-to-understand name.}
@@ -1835,7 +1832,7 @@
 the number of terms in $r^* \backslash s$ will grow exponentially rather than linearly
 in the sequence case.
 The good news is that the function $\textit{rsimp}$ will again
-ignore the difference between differently nesting patterns of alternatives,
+ignore the difference between different nesting patterns of alternatives,
 and the exponentially growing star derivative like
 \begin{center}
 	$(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + 
@@ -2020,7 +2017,7 @@
 	holds.
 \end{lemma}
 \begin{proof}
-	By an induction on the inductivev cases of $\textit{createdByStar}$.
+	By an induction on the inductive cases of $\textit{createdByStar}$.
 \end{proof}
 %This is not entirely true for annotated regular expressions: 
 %%TODO: bsimp bders \neq bderssimp
@@ -2090,7 +2087,7 @@
 \begin{proof}
 	By using the rewriting relation $\rightsquigarrow$
 \end{proof}
-And from this we obtain that a 
+And from this we obtain the following fact: a 
 regular expression created by star 
 is the same as its flattened version, up to equivalence under $\textit{bsimp}$.
 For example,
@@ -2238,17 +2235,17 @@
 
 In this section, we introduce how we formalised the bound
 on closed forms.
-We first show that in general regular expressions up to a certain 
-size are finite.
+We first show that in general the number of regular expressions up to a certain 
+size is finite.
 Then we prove that functions such as $\rflts$
 will not cause the size of r-regular expressions to grow.
 Putting this together with a general bound 
 on the finiteness of distinct regular expressions
-up to a certain size, we obtain a bound on 
+up to a specific size, we obtain a bound on 
 the closed forms.
 
 \subsection{Finiteness of Distinct Regular Expressions}
-We define the set of regular expressions whose size are no more than
+We define the set of regular expressions whose size is no more than
 a certain size $N$ as $\textit{sizeNregex} \; N$:
 \[
 	\textit{sizeNregex} \; N \dn \{r\; \mid \;  \llbracket r \rrbracket_r \leq N \}
@@ -2261,7 +2258,7 @@
 	By splitting the set $\textit{sizeNregex} \; (N + 1)$ into
 	subsets by their categories:
 	$\{\ZERO_r, \ONE_r, c\}$, $\{r^* \mid r \in \textit{sizeNregex} \; N\}$,
-	and so on. Each of these subsets are finitely bounded.
+	and so on. Each of these subsets is finitely bounded.
 \end{proof}
 \noindent
 From this we get a corollary that
@@ -2329,7 +2326,7 @@
 	\end{itemize}
 \end{lemma}
 \begin{proof}
-	Point 1, 3, 4 can be proven by an induction on $rs$.
+	Points 1, 3, and 4 can be proven by an induction on $rs$.
 	Point 2 is by case analysis on $r_1$ and $r_2$.
 	The last part is a corollary of the previous ones.
 \end{proof}
@@ -2374,8 +2371,8 @@
 will be solely comprised of $r_1 \backslash s'$ 
 and $r_2 \backslash s''$, $s'$ and $s''$ being 
 sub-strings of $s$).
-which will each have a size uppder bound 
-according to inductive hypothesis, which controls $r \backslash s$.
+which will each have a size upper bound 
+according to the inductive hypothesis, which controls $r \backslash s$.
 
 We elaborate the above reasoning by a series of lemmas
 below, where straightforward proofs are omitted.
@@ -2391,7 +2388,7 @@
 We need this so that we know the outcome of our real 
 simplification is better than or equal to a rough estimate,
 and therefore can be bounded by that estimate.
-This is a bit harder to establish compared with proving
+This is a bit harder to establish compared to proving
 $\textit{flts}$ does not make a list larger (which can
 be proven using routine induction):
 \begin{center}
@@ -2486,7 +2483,7 @@
 	By using corollary \ref{interactionFltsDB}.
 \end{proof}
 \noindent
-This is a key lemma in establishing the bounds on all the 
+This is a key lemma in establishing the bounds of all the 
 closed forms.
 With this we are now ready to control the sizes of
 $(r_1 \cdot r_2 )\backslash s$ and $r^* \backslash s$.
@@ -2641,7 +2638,7 @@
 \end{center}
 \noindent
 Arguably we should use $\log \; n$ for the size because
-the number of digits increase logarithmically w.r.t $n$.
+the number of digits increases logarithmically w.r.t $n$.
 For simplicity we choose to add the counter directly to the size.
 
 The derivative w.r.t a bounded regular expression
@@ -2693,7 +2690,7 @@
 The rewrite rules such as $\rightsquigarrow$, $\stackrel{s}{\rightsquigarrow}$ and so on
 do not need to be changed,
 and only a few lemmas such as lemma \ref{fltsPreserves} need to be adjusted to 
-add one more line which can be solved by sledgehammer 
+add one more line which can be solved by the Sledgehammer tool
 to solve the $r^{\{n\}}$ inductive case.
 
 
@@ -2918,7 +2915,7 @@
 	then $\cbn \; (r\backslash_r c)$ holds.
 \end{proof}
 \noindent
-In addition, for $\cbn$-shaped regular expressioins, one can flatten
+In addition, for $\cbn$-shaped regular expressions, one can flatten
 them:
 \begin{lemma}\label{ntimesHfauPushin}
 	If $\cbn \; r$ holds, then $\hflataux{r \backslash_r c} = 
@@ -2981,7 +2978,7 @@
 \end{center}
 \begin{lemma}\label{nupdatesNonempty}
 	If for all elements $o \in \textit{set} \; Ss$,
-	$\nString \; o$ holds, the we have that
+	$\nString \; o$ holds, then we have that
 	for all elements $o' \in \textit{set} \; (\nupdates \; s \; r \; Ss)$,
 	$\nString \; o'$ holds.
 \end{lemma}
@@ -3109,8 +3106,8 @@
 for constructs like $r^{\{\ldots n\}}$, $r^{\{n \ldots\}}$
 and so on, which is still work in progress.
 They should more or less follow the same recipe described in this section.
-Once we know about how to deal with them recursively using suitable auxiliary
-definitions, we are able to routinely establish the proofs.
+Once we know how to deal with them recursively using suitable auxiliary
+definitions, we can routinely establish the proofs.
 
 
 %----------------------------------------------------------------------------------------
@@ -3160,7 +3157,7 @@
 This means the bound we have will surge up at least
 tower-exponentially with a linear increase of the depth.
 
-One might be quite skeptical about what this non-elementary
+One might be pretty skepticafl about what this non-elementary
 bound can bring us.
 It turns out that the giant bounds are far from being hit.
 Here we have some test data from randomly generated regular expressions:
@@ -3232,7 +3229,7 @@
 	\end{tabular}    
   \caption{Graphs: size change of 3 randomly generated 
   regular expressions $w.r.t.$ input string length. 
-  The x axis represents the length of input.}
+  The x-axis represents the length of the input.}
 \end{figure}  
 \noindent
 Most of the regex's sizes seem to stay within a polynomial bound $w.r.t$ the 
@@ -3259,8 +3256,8 @@
 		The size proof can serve as a starting point for a complexity
 		formalisation.
 		Derivatives are the most important phases of our lexer algorithm.
-		Size properties about derivatives covers the majority of the algorithm
-		and is therefore a good indication of complexity of the entire program.
+		Size properties about derivatives cover the majority of the algorithm
+		and is therefore a good indication of the complexity of the entire program.
 	\item
 		The bound is already a strong indication that catastrophic
 		backtracking is much less likely to occur in our $\blexersimp$
@@ -3280,8 +3277,8 @@
 
 
 
-One might wonder the actual bound rather than the loose bound we gave
-for the convenience of an easier proof.
+One might wonder about the actual bound rather than the loose bound we gave
+for the convenience of a more straightforward proof.
 How much can the regex $r^* \backslash s$ grow? 
 As  earlier graphs have shown,
 %TODO: reference that graph where size grows quickly
@@ -3322,7 +3319,7 @@
 The exponential size is triggered by that the regex
 $\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$
 inside the $(\ldots) ^*$ having exponentially many
-different derivatives, despite those difference being minor.
+different derivatives, despite those differences being minor.
 $(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*\backslash \underbrace{a \ldots a}_{\text{m a's}}$
 will therefore contain the following terms (after flattening out all nested 
 alternatives):
@@ -3408,7 +3405,7 @@
 %	SUBSECTION 1
 %-----------------------------------
 %\subsection{Syntactic Equivalence Under $\simp$}
-%We prove that minor differences can be annhilated
+%We prove that minor differences can be annihilated
 %by $\simp$.
 %For example,
 %\begin{center}