--- a/ChengsongTanPhdThesis/Chapters/Finite.tex Fri Dec 30 17:37:51 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Finite.tex Fri Dec 30 23:41:44 2022 +0000
@@ -8,12 +8,12 @@
%regex's derivatives.
-In this chapter we give a bound in terms of size of
+In this chapter we give a bound in terms of the size of
the calculated derivatives:
given an annotated regular expression $a$, for any string $s$
our algorithm $\blexersimp$'s derivatives
are finitely bounded
-by a constant that only depends on $a$.
+by a constant that only depends on $a$.
Formally we show that there exists an $N_a$ such that
\begin{center}
$\llbracket \bderssimp{a}{s} \rrbracket \leq N_a$
@@ -22,9 +22,9 @@
where the size ($\llbracket \_ \rrbracket$) of
an annotated regular expression is defined
in terms of the number of nodes in its
-tree structure (its recursive definition given in the next page).
+tree structure (its recursive definition is given in the next page).
We believe this size bound
-is important in the context of POSIX lexing, because
+is important in the context of POSIX lexing because
\begin{itemize}
\item
It is a stepping stone towards the goal
@@ -37,8 +37,8 @@
is not just finite but polynomial in $\llbracket a\rrbracket$.
\item
Having the finite bound formalised
- gives us a higher confidence that
- our simplification algorithm $\simp$ does not ``mis-behave''
+ gives us higher confidence that
+ our simplification algorithm $\simp$ does not ``misbehave''
like $\textit{simpSL}$ does.
The bound is universal for a given regular expression,
which is an advantage over work which
@@ -121,7 +121,7 @@
stay below a size bound $N_a$ determined by the input $a$.
\noindent
-Sulzmann and Lu's assumed a similar picture about their algorithm,
+Sulzmann and Lu's assumed a similar picture of their algorithm,
though in fact their algorithm's size might be better depicted by the following graph:
\begin{figure}[H]
\begin{tikzpicture}[scale=2,
@@ -162,11 +162,11 @@
They tested out the run time of their
lexer on particular examples such as $(a+b+ab)^*$
and claimed that their algorithm is linear w.r.t to the input.
-With our mecahnised proof, we avoid this type of unintentional
+With our mechanised proof, we avoid this type of unintentional
generalisation.
-Before delving in to the details of the formalisation,
-we are going to provide an overview of it in the next subsection.
+Before delving into the details of the formalisation,
+we are going to provide an overview of it in the following subsection.
\subsection{Overview of the Proof}
@@ -236,7 +236,7 @@
allowing a more convenient size bound proof.
%Of course, the bits which encode the lexing information
%would grow linearly with respect
-%to the input, which should be taken into account when we wish to tackle the runtime comlexity.
+%to the input, which should be taken into accounte when we wish to tackle the runtime complexity.
%But for the sake of the structural size
%we can safely ignore them.\\
The datatype
@@ -269,10 +269,7 @@
\noindent
The $r$ in the subscript of $\llbracket \rrbracket_r$ is to
differentiate with the same operation for annotated regular expressions.
-Adding $r$ as subscript will be used in
-other operations as well.\\
-The transformation from an annotated regular expression
-to an r-regular expression is straightforward.
+Similar subscripts will be added for operations like $\rerase{}$:
\begin{center}
\begin{tabular}{lcl}
$\rerase{\ZERO}$ & $\dn$ & $\RZERO$\\
@@ -314,7 +311,7 @@
The annotated regular expression $\sum[a, b, c]$ would turn into
$(a+(b+c))$.
All these operations change the size and structure of
-an annotated regular expressions, adding unnecessary
+an annotated regular expression, adding unnecessary
complexities to the size bound proof.
For example, if we define the size of a basic plain regular expression
@@ -522,9 +519,9 @@
%\begin{itemize}
% \item
% First, we rewrite $r\backslash s$ into something else that is easier
-% to bound. This step is especially important for the inductive case
+% to bound. This step is crucial for the inductive case
% $r_1 \cdot r_2$ and $r^*$, where the derivative can grow and bloat in a wild way,
-% but after simplification they will always be equal or smaller to a form consisting of an alternative
+% but after simplification, they will always be equal or smaller to a form consisting of an alternative
% list of regular expressions $f \; (g\; (\sum rs))$ with some functions applied to it, where each element will be distinct after the function application.
% \item
% Then, for such a sum list of regular expressions $f\; (g\; (\sum rs))$, we can control its size
@@ -580,7 +577,7 @@
in the simplification, which prevents the $rs$ from growing indefinitely.
Based on this idea, we develop a proof in two steps.
-First, we show the equality (where
+First, we show the below equality (where
$f$ and $g$ are functions that do not increase the size of the input)
\begin{center}
$r\backslash_{rsimps} s = f\; (\textit{rdistinct} \; (g\; \sum rs))$,
@@ -604,7 +601,7 @@
\section{Closed Forms}
In this section we introduce in detail
-how we express the string derivatives
+how to express the string derivatives
of regular expressions (i.e. $r \backslash_r s$ where $s$ is a string
rather than a single character) in a different way than
our previous definition.
@@ -623,16 +620,16 @@
We call such more concrete representations the ``closed forms'' of
string derivatives as opposed to their original definitions.
The terminology ``closed form'' is borrowed from mathematics,
-which usually describe expressions that are solely comprised of
+which usually describe expressions that are solely comprised of finitely many
well-known and easy-to-compute operations such as
-additions, multiplications, exponential functions.
+additions, multiplications, and exponential functions.
We start by proving some basic identities
involving the simplification functions for r-regular expressions.
After that we introduce the rewrite relations
$\rightsquigarrow_h$, $\rightsquigarrow^*_{scf}$
$\rightsquigarrow_f$ and $\rightsquigarrow_g$.
-These relations involves similar techniques as in chapter \ref{Bitcoded2}
+These relations involve similar techniques as in chapter \ref{Bitcoded2}
for annotated regular expressions.
Finally, we use these identities to establish the
closed forms of the alternative regular expression,
@@ -645,14 +642,14 @@
In what follows we will often convert between lists
and sets.
-We use Isabelle's $set$ to refere to the
+We use Isabelle's $set$ to refer to the
function that converts a list $rs$ to the set
containing all the elements in $rs$.
\subsubsection{$\textit{rdistinct}$'s Does the Job of De-duplication}
The $\textit{rdistinct}$ function, as its name suggests, will
de-duplicate an r-regular expression list.
It will also remove any elements that
-is already in the accumulator set.
+are already in the accumulator set.
\begin{lemma}\label{rdistinctDoesTheJob}
%The function $\textit{rdistinct}$ satisfies the following
%properties:
@@ -675,7 +672,7 @@
\noindent
\begin{proof}
The first part is by an induction on $rs$.
- The second and third part can be proven by using the
+ The second and third parts can be proven by using the
inductive cases of $\textit{rdistinct}$.
\end{proof}
@@ -804,16 +801,16 @@
We give in this subsection some properties
involving $\backslash_r$, $\backslash_{rsimps}$, $\textit{rflts}$ and
$\textit{rsimp}_{ALTS} $, together with any non-trivial lemmas that lead to them.
-These will be helpful in later closed form proofs, when
+These will be helpful in later closed-form proofs, when
we want to transform derivative terms which have
%the ways in which multiple functions involving
%those are composed together
-interleaving derivatives and simplifications applied to them.
+interleaving derivatives and simplifications applied to them.
\noindent
%When the function $\textit{Rflts}$
-%is applied to the concatenation of two lists, the output can be calculated by first applying the
-%functions on two lists separately, and then concatenating them together.
+%is applied to the concatenation of two lists; the output can be calculated by first applying the
+%functions on two lists separately and then concatenating them together.
$\textit{Rflts}$ is composable in terms of concatenation:
\begin{lemma}\label{rfltsProps}
The function $\rflts$ has the properties below:\\
@@ -919,10 +916,10 @@
which evaluates to true when the regular expression is not an
alternative, and false otherwise.
The $\good$ property is preserved under $\rsimp_{ALTS}$, provided that
-its non-empty argument list of expressions are all good themsleves, and $\textit{nonAlt}$,
+its non-empty argument list of expressions are all good themselves, and $\textit{nonAlt}$,
and unique:
\begin{lemma}\label{rsimpaltsGood}
- If $rs \neq []$ and forall $r \in rs. \textit{nonAlt} \; r$ and $\textit{isDistinct} \; rs$,
+ If $rs \neq []$ and for all $r \in rs. \textit{nonAlt} \; r$ and $\textit{isDistinct} \; rs$,
then $\good \; (\rsimpalts \; rs)$ if and only if forall $r \in rs. \; \good \; r$.
\end{lemma}
\noindent
@@ -962,7 +959,7 @@
By induction on $r$.
\end{proof}
\noindent
-With this we can prove that a regular expressions
+With this we can prove that a regular expression
after simplification and flattening and de-duplication,
will not contain any alternative regular expression directly:
\begin{lemma}\label{nonaltFltsRd}
@@ -995,8 +992,8 @@
\end{itemize}
\end{lemma}
\begin{proof}
- The first part is by induction on $rs$, where the induction
- rule is the inductive cases for $\rflts$.
+ The first part is by induction, where the inductive cases
+ are the inductive cases of $\rflts$.
The second part is a corollary from the first part.
\end{proof}
@@ -1012,7 +1009,7 @@
$\llbracket \rsimp{r}\rrbracket_r$ is smaller than or equal to
$\llbracket r \rrbracket_r$.
Therefore, in the $r_1 \cdot r_2$ and $\sum rs$ case,
- Inductive hypothesis applies to the children regular expressions
+ The inductive hypothesis applies to the children regular expressions
$r_1$, $r_2$, etc. The lemma \ref{flts3Obv}'s precondition is satisfied
by that as well.
The lemmas \ref{nonnestedRsimp} and \ref{nonaltFltsRd} are used
@@ -1101,7 +1098,7 @@
\end{itemize}
\end{corollary}
\noindent
-This will be useful in the later closed form proof's rewriting steps.
+This will be useful in the later closed-form proof's rewriting steps.
Similarly, we state the following useful facts below:
\begin{lemma}
The following equalities hold if $r = \rsimp{r'}$ for some $r'$:
@@ -1442,7 +1439,7 @@
If $r \hrewrites r' $ then $\rder{c}{r} \hrewrites \rder{c}{r'}$
\end{lemma}
\noindent
-This allows us proving more $\mathbf{rsimp}$-equivalent terms, involving $\backslash_r$.
+This allows us to prove more $\mathbf{rsimp}$-equivalent terms, involving $\backslash_r$.
\begin{lemma}\label{insideSimpRemoval}
$\rsimp{(\rder{c}{(\rsimp{r})})} = \rsimp{(\rder{c}{r})} $
\end{lemma}
@@ -1530,7 +1527,7 @@
$ \longrightarrow_{\backslash c''} \quad \ldots$\\
\end{tabular}
\end{center}
-Roughly speaking $r_1 \cdot r_2 \backslash s$ can be expresssed as
+Roughly speaking $r_1 \cdot r_2 \backslash s$ can be expressed as
a giant alternative taking a list of terms
$[r_1 \backslash_r s \cdot r_2, r_2 \backslash_r s'', r_2 \backslash_r s_1'', \ldots]$,
where the head of the list is always the term
@@ -1600,13 +1597,13 @@
and only generate the
$r_2 \backslash_r s''$ terms satisfying the property
\begin{center}
-$\exists s'. such \; that \; s'@s'' = s \;\; \land \;\;
+$\exists s'. such \; that \; s'@s'' = s \;\; \land \;\;
r_1 \backslash s' \; is \; nullable$.
\end{center}
Given the arguments $s$ and $r_1$, we denote the list of strings
$s''$ satisfying the above property as $\vsuf{s}{r_1}$.
The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string\footnote{
- Perhaps a better name of it would be ``NullablePrefixSuffix''
+ Perhaps a better name for it would be ``NullablePrefixSuffix''
to differentiate with the list of \emph{all} prefixes of $s$, but
that is a bit too long for a function name and we are yet to find
a more concise and easy-to-understand name.}
@@ -1835,7 +1832,7 @@
the number of terms in $r^* \backslash s$ will grow exponentially rather than linearly
in the sequence case.
The good news is that the function $\textit{rsimp}$ will again
-ignore the difference between differently nesting patterns of alternatives,
+ignore the difference between different nesting patterns of alternatives,
and the exponentially growing star derivative like
\begin{center}
$(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') +
@@ -2020,7 +2017,7 @@
holds.
\end{lemma}
\begin{proof}
- By an induction on the inductivev cases of $\textit{createdByStar}$.
+ By an induction on the inductive cases of $\textit{createdByStar}$.
\end{proof}
%This is not entirely true for annotated regular expressions:
%%TODO: bsimp bders \neq bderssimp
@@ -2090,7 +2087,7 @@
\begin{proof}
By using the rewriting relation $\rightsquigarrow$
\end{proof}
-And from this we obtain that a
+And from this we obtain the following fact: a
regular expression created by star
is the same as its flattened version, up to equivalence under $\textit{bsimp}$.
For example,
@@ -2238,17 +2235,17 @@
In this section, we introduce how we formalised the bound
on closed forms.
-We first show that in general regular expressions up to a certain
-size are finite.
+We first show that in general the number of regular expressions up to a certain
+size is finite.
Then we prove that functions such as $\rflts$
will not cause the size of r-regular expressions to grow.
Putting this together with a general bound
on the finiteness of distinct regular expressions
-up to a certain size, we obtain a bound on
+up to a specific size, we obtain a bound on
the closed forms.
\subsection{Finiteness of Distinct Regular Expressions}
-We define the set of regular expressions whose size are no more than
+We define the set of regular expressions whose size is no more than
a certain size $N$ as $\textit{sizeNregex} \; N$:
\[
\textit{sizeNregex} \; N \dn \{r\; \mid \; \llbracket r \rrbracket_r \leq N \}
@@ -2261,7 +2258,7 @@
By splitting the set $\textit{sizeNregex} \; (N + 1)$ into
subsets by their categories:
$\{\ZERO_r, \ONE_r, c\}$, $\{r^* \mid r \in \textit{sizeNregex} \; N\}$,
- and so on. Each of these subsets are finitely bounded.
+ and so on. Each of these subsets is finitely bounded.
\end{proof}
\noindent
From this we get a corollary that
@@ -2329,7 +2326,7 @@
\end{itemize}
\end{lemma}
\begin{proof}
- Point 1, 3, 4 can be proven by an induction on $rs$.
+ Points 1, 3, and 4 can be proven by an induction on $rs$.
Point 2 is by case analysis on $r_1$ and $r_2$.
The last part is a corollary of the previous ones.
\end{proof}
@@ -2374,8 +2371,8 @@
will be solely comprised of $r_1 \backslash s'$
and $r_2 \backslash s''$, $s'$ and $s''$ being
sub-strings of $s$).
-which will each have a size uppder bound
-according to inductive hypothesis, which controls $r \backslash s$.
+which will each have a size upper bound
+according to the inductive hypothesis, which controls $r \backslash s$.
We elaborate the above reasoning by a series of lemmas
below, where straightforward proofs are omitted.
@@ -2391,7 +2388,7 @@
We need this so that we know the outcome of our real
simplification is better than or equal to a rough estimate,
and therefore can be bounded by that estimate.
-This is a bit harder to establish compared with proving
+This is a bit harder to establish compared to proving
$\textit{flts}$ does not make a list larger (which can
be proven using routine induction):
\begin{center}
@@ -2486,7 +2483,7 @@
By using corollary \ref{interactionFltsDB}.
\end{proof}
\noindent
-This is a key lemma in establishing the bounds on all the
+This is a key lemma in establishing the bounds of all the
closed forms.
With this we are now ready to control the sizes of
$(r_1 \cdot r_2 )\backslash s$ and $r^* \backslash s$.
@@ -2641,7 +2638,7 @@
\end{center}
\noindent
Arguably we should use $\log \; n$ for the size because
-the number of digits increase logarithmically w.r.t $n$.
+the number of digits increases logarithmically w.r.t $n$.
For simplicity we choose to add the counter directly to the size.
The derivative w.r.t a bounded regular expression
@@ -2693,7 +2690,7 @@
The rewrite rules such as $\rightsquigarrow$, $\stackrel{s}{\rightsquigarrow}$ and so on
do not need to be changed,
and only a few lemmas such as lemma \ref{fltsPreserves} need to be adjusted to
-add one more line which can be solved by sledgehammer
+add one more line which can be solved by the Sledgehammer tool
to solve the $r^{\{n\}}$ inductive case.
@@ -2918,7 +2915,7 @@
then $\cbn \; (r\backslash_r c)$ holds.
\end{proof}
\noindent
-In addition, for $\cbn$-shaped regular expressioins, one can flatten
+In addition, for $\cbn$-shaped regular expressions, one can flatten
them:
\begin{lemma}\label{ntimesHfauPushin}
If $\cbn \; r$ holds, then $\hflataux{r \backslash_r c} =
@@ -2981,7 +2978,7 @@
\end{center}
\begin{lemma}\label{nupdatesNonempty}
If for all elements $o \in \textit{set} \; Ss$,
- $\nString \; o$ holds, the we have that
+ $\nString \; o$ holds, then we have that
for all elements $o' \in \textit{set} \; (\nupdates \; s \; r \; Ss)$,
$\nString \; o'$ holds.
\end{lemma}
@@ -3109,8 +3106,8 @@
for constructs like $r^{\{\ldots n\}}$, $r^{\{n \ldots\}}$
and so on, which is still work in progress.
They should more or less follow the same recipe described in this section.
-Once we know about how to deal with them recursively using suitable auxiliary
-definitions, we are able to routinely establish the proofs.
+Once we know how to deal with them recursively using suitable auxiliary
+definitions, we can routinely establish the proofs.
%----------------------------------------------------------------------------------------
@@ -3160,7 +3157,7 @@
This means the bound we have will surge up at least
tower-exponentially with a linear increase of the depth.
-One might be quite skeptical about what this non-elementary
+One might be pretty skepticafl about what this non-elementary
bound can bring us.
It turns out that the giant bounds are far from being hit.
Here we have some test data from randomly generated regular expressions:
@@ -3232,7 +3229,7 @@
\end{tabular}
\caption{Graphs: size change of 3 randomly generated
regular expressions $w.r.t.$ input string length.
- The x axis represents the length of input.}
+ The x-axis represents the length of the input.}
\end{figure}
\noindent
Most of the regex's sizes seem to stay within a polynomial bound $w.r.t$ the
@@ -3259,8 +3256,8 @@
The size proof can serve as a starting point for a complexity
formalisation.
Derivatives are the most important phases of our lexer algorithm.
- Size properties about derivatives covers the majority of the algorithm
- and is therefore a good indication of complexity of the entire program.
+ Size properties about derivatives cover the majority of the algorithm
+ and is therefore a good indication of the complexity of the entire program.
\item
The bound is already a strong indication that catastrophic
backtracking is much less likely to occur in our $\blexersimp$
@@ -3280,8 +3277,8 @@
-One might wonder the actual bound rather than the loose bound we gave
-for the convenience of an easier proof.
+One might wonder about the actual bound rather than the loose bound we gave
+for the convenience of a more straightforward proof.
How much can the regex $r^* \backslash s$ grow?
As earlier graphs have shown,
%TODO: reference that graph where size grows quickly
@@ -3322,7 +3319,7 @@
The exponential size is triggered by that the regex
$\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$
inside the $(\ldots) ^*$ having exponentially many
-different derivatives, despite those difference being minor.
+different derivatives, despite those differences being minor.
$(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*\backslash \underbrace{a \ldots a}_{\text{m a's}}$
will therefore contain the following terms (after flattening out all nested
alternatives):
@@ -3408,7 +3405,7 @@
% SUBSECTION 1
%-----------------------------------
%\subsection{Syntactic Equivalence Under $\simp$}
-%We prove that minor differences can be annhilated
+%We prove that minor differences can be annihilated
%by $\simp$.
%For example,
%\begin{center}