diff -r 80cc6dc4c98b -r bd1354127574 ChengsongTanPhdThesis/Chapters/Finite.tex --- a/ChengsongTanPhdThesis/Chapters/Finite.tex Fri Dec 30 17:37:51 2022 +0000 +++ b/ChengsongTanPhdThesis/Chapters/Finite.tex Fri Dec 30 23:41:44 2022 +0000 @@ -8,12 +8,12 @@ %regex's derivatives. -In this chapter we give a bound in terms of size of +In this chapter we give a bound in terms of the size of the calculated derivatives: given an annotated regular expression $a$, for any string $s$ our algorithm $\blexersimp$'s derivatives are finitely bounded -by a constant that only depends on $a$. +by a constant that only depends on $a$. Formally we show that there exists an $N_a$ such that \begin{center} $\llbracket \bderssimp{a}{s} \rrbracket \leq N_a$ @@ -22,9 +22,9 @@ where the size ($\llbracket \_ \rrbracket$) of an annotated regular expression is defined in terms of the number of nodes in its -tree structure (its recursive definition given in the next page). +tree structure (its recursive definition is given in the next page). We believe this size bound -is important in the context of POSIX lexing, because +is important in the context of POSIX lexing because \begin{itemize} \item It is a stepping stone towards the goal @@ -37,8 +37,8 @@ is not just finite but polynomial in $\llbracket a\rrbracket$. \item Having the finite bound formalised - gives us a higher confidence that - our simplification algorithm $\simp$ does not ``mis-behave'' + gives us higher confidence that + our simplification algorithm $\simp$ does not ``misbehave'' like $\textit{simpSL}$ does. The bound is universal for a given regular expression, which is an advantage over work which @@ -121,7 +121,7 @@ stay below a size bound $N_a$ determined by the input $a$. \noindent -Sulzmann and Lu's assumed a similar picture about their algorithm, +Sulzmann and Lu's assumed a similar picture of their algorithm, though in fact their algorithm's size might be better depicted by the following graph: \begin{figure}[H] \begin{tikzpicture}[scale=2, @@ -162,11 +162,11 @@ They tested out the run time of their lexer on particular examples such as $(a+b+ab)^*$ and claimed that their algorithm is linear w.r.t to the input. -With our mecahnised proof, we avoid this type of unintentional +With our mechanised proof, we avoid this type of unintentional generalisation. -Before delving in to the details of the formalisation, -we are going to provide an overview of it in the next subsection. +Before delving into the details of the formalisation, +we are going to provide an overview of it in the following subsection. \subsection{Overview of the Proof} @@ -236,7 +236,7 @@ allowing a more convenient size bound proof. %Of course, the bits which encode the lexing information %would grow linearly with respect -%to the input, which should be taken into account when we wish to tackle the runtime comlexity. +%to the input, which should be taken into accounte when we wish to tackle the runtime complexity. %But for the sake of the structural size %we can safely ignore them.\\ The datatype @@ -269,10 +269,7 @@ \noindent The $r$ in the subscript of $\llbracket \rrbracket_r$ is to differentiate with the same operation for annotated regular expressions. -Adding $r$ as subscript will be used in -other operations as well.\\ -The transformation from an annotated regular expression -to an r-regular expression is straightforward. +Similar subscripts will be added for operations like $\rerase{}$: \begin{center} \begin{tabular}{lcl} $\rerase{\ZERO}$ & $\dn$ & $\RZERO$\\ @@ -314,7 +311,7 @@ The annotated regular expression $\sum[a, b, c]$ would turn into $(a+(b+c))$. All these operations change the size and structure of -an annotated regular expressions, adding unnecessary +an annotated regular expression, adding unnecessary complexities to the size bound proof. For example, if we define the size of a basic plain regular expression @@ -522,9 +519,9 @@ %\begin{itemize} % \item % First, we rewrite $r\backslash s$ into something else that is easier -% to bound. This step is especially important for the inductive case +% to bound. This step is crucial for the inductive case % $r_1 \cdot r_2$ and $r^*$, where the derivative can grow and bloat in a wild way, -% but after simplification they will always be equal or smaller to a form consisting of an alternative +% but after simplification, they will always be equal or smaller to a form consisting of an alternative % list of regular expressions $f \; (g\; (\sum rs))$ with some functions applied to it, where each element will be distinct after the function application. % \item % Then, for such a sum list of regular expressions $f\; (g\; (\sum rs))$, we can control its size @@ -580,7 +577,7 @@ in the simplification, which prevents the $rs$ from growing indefinitely. Based on this idea, we develop a proof in two steps. -First, we show the equality (where +First, we show the below equality (where $f$ and $g$ are functions that do not increase the size of the input) \begin{center} $r\backslash_{rsimps} s = f\; (\textit{rdistinct} \; (g\; \sum rs))$, @@ -604,7 +601,7 @@ \section{Closed Forms} In this section we introduce in detail -how we express the string derivatives +how to express the string derivatives of regular expressions (i.e. $r \backslash_r s$ where $s$ is a string rather than a single character) in a different way than our previous definition. @@ -623,16 +620,16 @@ We call such more concrete representations the ``closed forms'' of string derivatives as opposed to their original definitions. The terminology ``closed form'' is borrowed from mathematics, -which usually describe expressions that are solely comprised of +which usually describe expressions that are solely comprised of finitely many well-known and easy-to-compute operations such as -additions, multiplications, exponential functions. +additions, multiplications, and exponential functions. We start by proving some basic identities involving the simplification functions for r-regular expressions. After that we introduce the rewrite relations $\rightsquigarrow_h$, $\rightsquigarrow^*_{scf}$ $\rightsquigarrow_f$ and $\rightsquigarrow_g$. -These relations involves similar techniques as in chapter \ref{Bitcoded2} +These relations involve similar techniques as in chapter \ref{Bitcoded2} for annotated regular expressions. Finally, we use these identities to establish the closed forms of the alternative regular expression, @@ -645,14 +642,14 @@ In what follows we will often convert between lists and sets. -We use Isabelle's $set$ to refere to the +We use Isabelle's $set$ to refer to the function that converts a list $rs$ to the set containing all the elements in $rs$. \subsubsection{$\textit{rdistinct}$'s Does the Job of De-duplication} The $\textit{rdistinct}$ function, as its name suggests, will de-duplicate an r-regular expression list. It will also remove any elements that -is already in the accumulator set. +are already in the accumulator set. \begin{lemma}\label{rdistinctDoesTheJob} %The function $\textit{rdistinct}$ satisfies the following %properties: @@ -675,7 +672,7 @@ \noindent \begin{proof} The first part is by an induction on $rs$. - The second and third part can be proven by using the + The second and third parts can be proven by using the inductive cases of $\textit{rdistinct}$. \end{proof} @@ -804,16 +801,16 @@ We give in this subsection some properties involving $\backslash_r$, $\backslash_{rsimps}$, $\textit{rflts}$ and $\textit{rsimp}_{ALTS} $, together with any non-trivial lemmas that lead to them. -These will be helpful in later closed form proofs, when +These will be helpful in later closed-form proofs, when we want to transform derivative terms which have %the ways in which multiple functions involving %those are composed together -interleaving derivatives and simplifications applied to them. +interleaving derivatives and simplifications applied to them. \noindent %When the function $\textit{Rflts}$ -%is applied to the concatenation of two lists, the output can be calculated by first applying the -%functions on two lists separately, and then concatenating them together. +%is applied to the concatenation of two lists; the output can be calculated by first applying the +%functions on two lists separately and then concatenating them together. $\textit{Rflts}$ is composable in terms of concatenation: \begin{lemma}\label{rfltsProps} The function $\rflts$ has the properties below:\\ @@ -919,10 +916,10 @@ which evaluates to true when the regular expression is not an alternative, and false otherwise. The $\good$ property is preserved under $\rsimp_{ALTS}$, provided that -its non-empty argument list of expressions are all good themsleves, and $\textit{nonAlt}$, +its non-empty argument list of expressions are all good themselves, and $\textit{nonAlt}$, and unique: \begin{lemma}\label{rsimpaltsGood} - If $rs \neq []$ and forall $r \in rs. \textit{nonAlt} \; r$ and $\textit{isDistinct} \; rs$, + If $rs \neq []$ and for all $r \in rs. \textit{nonAlt} \; r$ and $\textit{isDistinct} \; rs$, then $\good \; (\rsimpalts \; rs)$ if and only if forall $r \in rs. \; \good \; r$. \end{lemma} \noindent @@ -962,7 +959,7 @@ By induction on $r$. \end{proof} \noindent -With this we can prove that a regular expressions +With this we can prove that a regular expression after simplification and flattening and de-duplication, will not contain any alternative regular expression directly: \begin{lemma}\label{nonaltFltsRd} @@ -995,8 +992,8 @@ \end{itemize} \end{lemma} \begin{proof} - The first part is by induction on $rs$, where the induction - rule is the inductive cases for $\rflts$. + The first part is by induction, where the inductive cases + are the inductive cases of $\rflts$. The second part is a corollary from the first part. \end{proof} @@ -1012,7 +1009,7 @@ $\llbracket \rsimp{r}\rrbracket_r$ is smaller than or equal to $\llbracket r \rrbracket_r$. Therefore, in the $r_1 \cdot r_2$ and $\sum rs$ case, - Inductive hypothesis applies to the children regular expressions + The inductive hypothesis applies to the children regular expressions $r_1$, $r_2$, etc. The lemma \ref{flts3Obv}'s precondition is satisfied by that as well. The lemmas \ref{nonnestedRsimp} and \ref{nonaltFltsRd} are used @@ -1101,7 +1098,7 @@ \end{itemize} \end{corollary} \noindent -This will be useful in the later closed form proof's rewriting steps. +This will be useful in the later closed-form proof's rewriting steps. Similarly, we state the following useful facts below: \begin{lemma} The following equalities hold if $r = \rsimp{r'}$ for some $r'$: @@ -1442,7 +1439,7 @@ If $r \hrewrites r' $ then $\rder{c}{r} \hrewrites \rder{c}{r'}$ \end{lemma} \noindent -This allows us proving more $\mathbf{rsimp}$-equivalent terms, involving $\backslash_r$. +This allows us to prove more $\mathbf{rsimp}$-equivalent terms, involving $\backslash_r$. \begin{lemma}\label{insideSimpRemoval} $\rsimp{(\rder{c}{(\rsimp{r})})} = \rsimp{(\rder{c}{r})} $ \end{lemma} @@ -1530,7 +1527,7 @@ $ \longrightarrow_{\backslash c''} \quad \ldots$\\ \end{tabular} \end{center} -Roughly speaking $r_1 \cdot r_2 \backslash s$ can be expresssed as +Roughly speaking $r_1 \cdot r_2 \backslash s$ can be expressed as a giant alternative taking a list of terms $[r_1 \backslash_r s \cdot r_2, r_2 \backslash_r s'', r_2 \backslash_r s_1'', \ldots]$, where the head of the list is always the term @@ -1600,13 +1597,13 @@ and only generate the $r_2 \backslash_r s''$ terms satisfying the property \begin{center} -$\exists s'. such \; that \; s'@s'' = s \;\; \land \;\; +$\exists s'. such \; that \; s'@s'' = s \;\; \land \;\; r_1 \backslash s' \; is \; nullable$. \end{center} Given the arguments $s$ and $r_1$, we denote the list of strings $s''$ satisfying the above property as $\vsuf{s}{r_1}$. The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string\footnote{ - Perhaps a better name of it would be ``NullablePrefixSuffix'' + Perhaps a better name for it would be ``NullablePrefixSuffix'' to differentiate with the list of \emph{all} prefixes of $s$, but that is a bit too long for a function name and we are yet to find a more concise and easy-to-understand name.} @@ -1835,7 +1832,7 @@ the number of terms in $r^* \backslash s$ will grow exponentially rather than linearly in the sequence case. The good news is that the function $\textit{rsimp}$ will again -ignore the difference between differently nesting patterns of alternatives, +ignore the difference between different nesting patterns of alternatives, and the exponentially growing star derivative like \begin{center} $(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + @@ -2020,7 +2017,7 @@ holds. \end{lemma} \begin{proof} - By an induction on the inductivev cases of $\textit{createdByStar}$. + By an induction on the inductive cases of $\textit{createdByStar}$. \end{proof} %This is not entirely true for annotated regular expressions: %%TODO: bsimp bders \neq bderssimp @@ -2090,7 +2087,7 @@ \begin{proof} By using the rewriting relation $\rightsquigarrow$ \end{proof} -And from this we obtain that a +And from this we obtain the following fact: a regular expression created by star is the same as its flattened version, up to equivalence under $\textit{bsimp}$. For example, @@ -2238,17 +2235,17 @@ In this section, we introduce how we formalised the bound on closed forms. -We first show that in general regular expressions up to a certain -size are finite. +We first show that in general the number of regular expressions up to a certain +size is finite. Then we prove that functions such as $\rflts$ will not cause the size of r-regular expressions to grow. Putting this together with a general bound on the finiteness of distinct regular expressions -up to a certain size, we obtain a bound on +up to a specific size, we obtain a bound on the closed forms. \subsection{Finiteness of Distinct Regular Expressions} -We define the set of regular expressions whose size are no more than +We define the set of regular expressions whose size is no more than a certain size $N$ as $\textit{sizeNregex} \; N$: \[ \textit{sizeNregex} \; N \dn \{r\; \mid \; \llbracket r \rrbracket_r \leq N \} @@ -2261,7 +2258,7 @@ By splitting the set $\textit{sizeNregex} \; (N + 1)$ into subsets by their categories: $\{\ZERO_r, \ONE_r, c\}$, $\{r^* \mid r \in \textit{sizeNregex} \; N\}$, - and so on. Each of these subsets are finitely bounded. + and so on. Each of these subsets is finitely bounded. \end{proof} \noindent From this we get a corollary that @@ -2329,7 +2326,7 @@ \end{itemize} \end{lemma} \begin{proof} - Point 1, 3, 4 can be proven by an induction on $rs$. + Points 1, 3, and 4 can be proven by an induction on $rs$. Point 2 is by case analysis on $r_1$ and $r_2$. The last part is a corollary of the previous ones. \end{proof} @@ -2374,8 +2371,8 @@ will be solely comprised of $r_1 \backslash s'$ and $r_2 \backslash s''$, $s'$ and $s''$ being sub-strings of $s$). -which will each have a size uppder bound -according to inductive hypothesis, which controls $r \backslash s$. +which will each have a size upper bound +according to the inductive hypothesis, which controls $r \backslash s$. We elaborate the above reasoning by a series of lemmas below, where straightforward proofs are omitted. @@ -2391,7 +2388,7 @@ We need this so that we know the outcome of our real simplification is better than or equal to a rough estimate, and therefore can be bounded by that estimate. -This is a bit harder to establish compared with proving +This is a bit harder to establish compared to proving $\textit{flts}$ does not make a list larger (which can be proven using routine induction): \begin{center} @@ -2486,7 +2483,7 @@ By using corollary \ref{interactionFltsDB}. \end{proof} \noindent -This is a key lemma in establishing the bounds on all the +This is a key lemma in establishing the bounds of all the closed forms. With this we are now ready to control the sizes of $(r_1 \cdot r_2 )\backslash s$ and $r^* \backslash s$. @@ -2641,7 +2638,7 @@ \end{center} \noindent Arguably we should use $\log \; n$ for the size because -the number of digits increase logarithmically w.r.t $n$. +the number of digits increases logarithmically w.r.t $n$. For simplicity we choose to add the counter directly to the size. The derivative w.r.t a bounded regular expression @@ -2693,7 +2690,7 @@ The rewrite rules such as $\rightsquigarrow$, $\stackrel{s}{\rightsquigarrow}$ and so on do not need to be changed, and only a few lemmas such as lemma \ref{fltsPreserves} need to be adjusted to -add one more line which can be solved by sledgehammer +add one more line which can be solved by the Sledgehammer tool to solve the $r^{\{n\}}$ inductive case. @@ -2918,7 +2915,7 @@ then $\cbn \; (r\backslash_r c)$ holds. \end{proof} \noindent -In addition, for $\cbn$-shaped regular expressioins, one can flatten +In addition, for $\cbn$-shaped regular expressions, one can flatten them: \begin{lemma}\label{ntimesHfauPushin} If $\cbn \; r$ holds, then $\hflataux{r \backslash_r c} = @@ -2981,7 +2978,7 @@ \end{center} \begin{lemma}\label{nupdatesNonempty} If for all elements $o \in \textit{set} \; Ss$, - $\nString \; o$ holds, the we have that + $\nString \; o$ holds, then we have that for all elements $o' \in \textit{set} \; (\nupdates \; s \; r \; Ss)$, $\nString \; o'$ holds. \end{lemma} @@ -3109,8 +3106,8 @@ for constructs like $r^{\{\ldots n\}}$, $r^{\{n \ldots\}}$ and so on, which is still work in progress. They should more or less follow the same recipe described in this section. -Once we know about how to deal with them recursively using suitable auxiliary -definitions, we are able to routinely establish the proofs. +Once we know how to deal with them recursively using suitable auxiliary +definitions, we can routinely establish the proofs. %---------------------------------------------------------------------------------------- @@ -3160,7 +3157,7 @@ This means the bound we have will surge up at least tower-exponentially with a linear increase of the depth. -One might be quite skeptical about what this non-elementary +One might be pretty skepticafl about what this non-elementary bound can bring us. It turns out that the giant bounds are far from being hit. Here we have some test data from randomly generated regular expressions: @@ -3232,7 +3229,7 @@ \end{tabular} \caption{Graphs: size change of 3 randomly generated regular expressions $w.r.t.$ input string length. - The x axis represents the length of input.} + The x-axis represents the length of the input.} \end{figure} \noindent Most of the regex's sizes seem to stay within a polynomial bound $w.r.t$ the @@ -3259,8 +3256,8 @@ The size proof can serve as a starting point for a complexity formalisation. Derivatives are the most important phases of our lexer algorithm. - Size properties about derivatives covers the majority of the algorithm - and is therefore a good indication of complexity of the entire program. + Size properties about derivatives cover the majority of the algorithm + and is therefore a good indication of the complexity of the entire program. \item The bound is already a strong indication that catastrophic backtracking is much less likely to occur in our $\blexersimp$ @@ -3280,8 +3277,8 @@ -One might wonder the actual bound rather than the loose bound we gave -for the convenience of an easier proof. +One might wonder about the actual bound rather than the loose bound we gave +for the convenience of a more straightforward proof. How much can the regex $r^* \backslash s$ grow? As earlier graphs have shown, %TODO: reference that graph where size grows quickly @@ -3322,7 +3319,7 @@ The exponential size is triggered by that the regex $\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$ inside the $(\ldots) ^*$ having exponentially many -different derivatives, despite those difference being minor. +different derivatives, despite those differences being minor. $(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*\backslash \underbrace{a \ldots a}_{\text{m a's}}$ will therefore contain the following terms (after flattening out all nested alternatives): @@ -3408,7 +3405,7 @@ % SUBSECTION 1 %----------------------------------- %\subsection{Syntactic Equivalence Under $\simp$} -%We prove that minor differences can be annhilated +%We prove that minor differences can be annihilated %by $\simp$. %For example, %\begin{center}