lexing: comparison ChengsongTanPhdThesis/Chapters/Finite.tex

equal deleted inserted replaced

-:dd9dde2d902b
+:80cc6dc4c98b
 		is not just finite but polynomial in $\llbracket a\rrbracket$.
 	\item
 		Having the finite bound formalised
 		gives us a higher confidence that
 		our simplification algorithm $\simp$ does not ``mis-behave''
-		like $\simpsulz$ does.
+		like $\textit{simpSL}$ does.
 		The bound is universal for a given regular expression,
 		which is an advantage over work which
 		only gives empirical evidence on
-		some test cases (see Verbatim++ work\cite{Verbatimpp}).
+		some test cases (see for example Verbatim work \cite{Verbatimpp}).
 \end{itemize}
 \noindent
 We then extend our $\blexersimp$
 to support bounded repetitions ($r^{\{n\}}$).
 We update our formalisation of
 the correctness and finiteness properties to
 include this new construct.
-we can out-compete other verified lexers such as
+We show that we can out-compete other verified lexers such as
 Verbatim++ on bounded regular expressions.
 In the next section we describe in more detail
 what the finite bound means in our algorithm
 and why the size of the internal data structures of
 a typical derivative-based lexer such as
-Sulzmann and Lu's need formal treatment.
+Sulzmann and Lu's needs formal treatment.
-\section{Formalising About Size}
+\section{Formalising Size Bound of Derivatives}
 \noindent
 In our lexer ($\blexersimp$),
 we take an annotated regular expression as input,
 and repeately take derivative of and simplify it.
 \begin{figure}
 \end{figure}
 \noindent
 Each time
 a derivative is taken, the regular expression might grow.
-However, the simplification that is immediately afterwards will always shrink it so that
+However, the simplification that is immediately afterwards will often shrink it so that
-it stays relatively small.
+the overall size of the derivatives stays relatively small.
 This intuition is depicted by the relative size
 change between the black and blue nodes:
-After $\simp$ the node always shrinks.
+After $\simp$ the node shrinks.
 Our proof states that all the blue nodes
 stay below a size bound $N_a$ determined by the input $a$.
 \noindent
 Sulzmann and Lu's assumed a similar picture about their algorithm,
 	\end{tikzpicture}
 	\caption{Regular expression size change during our $\blexersimp$ algorithm}\label{sulzShrinks}
 \end{figure}
 \noindent
-The picture means that on certain cases their lexer (where they use $\simpsulz$
+The picture means that in some cases their lexer (where they use $\simpsulz$
 as the simplification function)
 will have a size explosion, causing the running time
 of each derivative step to grow continuously (for example
 in \ref{SulzmannLuLexerTime}).
 They tested out the run time of their
 lexer on particular examples such as $(a+b+ab)^*$
 and claimed that their algorithm is linear w.r.t to the input.
 With our mecahnised proof, we avoid this type of unintentional
-generalisation.\\
+generalisation.
 Before delving in to the details of the formalisation,
-we are going to provide an overview of it.
+we are going to provide an overview of it in the next subsection.
-In the next subsection, we give a high-level view
-of the proof.
 \subsection{Overview of the Proof}
 A high-level overview of the main components of the finiteness proof
 is as follows:
 \section{The $\textit{Rrexp}$ Datatype}
 The first step is to define
 $\textit{rrexp}$s.
 They are annotated regular expressions without bitcodes,
-allowing a much simpler size bound proof.
+allowing a more convenient size bound proof.
 %Of course, the bits which encode the lexing information
 %would grow linearly with respect
 %to the input, which should be taken into account when we wish to tackle the runtime comlexity.
 %But for the sake of the structural size
 %we can safely ignore them.\\
 \emph{r-regular expressions},
 was initially defined in \ref{rrexpDef}.
 The reason for the prefix $r$ is
 to make a distinction
 with basic regular expressions.
+We give here again the definition of $\rrexp$.
 \[			\rrexp ::=   \RZERO \mid  \RONE
 	\mid  \RCHAR{c}
 	\mid  \RSEQ{r_1}{r_2}
 	\mid  \RALTS{rs}
 	\mid \RSTAR{r}
 		$\erase \; _{bs}\sum [a]$ & $\dn$ & $a$\\
 		$\erase \; _{bs}\sum a :: as$ & $\dn$ & $a + (\erase \; _{[]} \sum as)\quad \text{if $as$ length over 1}$
 	\end{tabular}
 \end{center}
 \noindent
-As can be seen alternative regular expression with an empty argument list
+As can be seen, alternative regular expressions with an empty argument list
 will be turned into a $\ZERO$.
 The singleton alternative $\sum [r]$ becomes $r$ during the
 $\erase$ function.
 The  annotated regular expression $\sum[a, b, c]$ would turn into
 $(a+(b+c))$.
 All these operations change the size and structure of
 an annotated regular expressions, adding unnecessary
-complexities to the size bound proof.\\
+complexities to the size bound proof.
 For example, if we define the size of a basic plain regular expression
 in the usual way,
 \begin{center}
 	\begin{tabular}{lcl}
 		$\llbracket \ONE \rrbracket_p$ & $\dn$ & $1$\\
 $\llbracket a \rrbracket  \leq \llbracket  a_\downarrow \rrbracket_p $
 and then estimate $\llbracket  a_\downarrow \rrbracket_p$,
 but we found our approach more straightforward.\\
 \subsection{Functions for R-regular Expressions}
+The downside of our approach is that we need to redefine
+several functions for $\rrexp$.
 In this section we shall define the r-regular expression version
-of $\blexer$, and $\blexersimp$ related functions.
+of $\bder$, and $\textit{bsimp}$ related functions.
 We use $r$ as the prefix or subscript to differentiate
 with the bitcoded version.
 %For example,$\backslash_r$, $\rdistincts$, and $\rsimp$
 %as opposed to $\backslash$, $\distinctBy$, and $\bsimp$.
 %As promised, they are much simpler than their bitcoded counterparts.
 		(_{[]}r^*))$
 	\end{tabular}
 \end{center}
 \noindent
 where we omit the definition of $\textit{rnullable}$.
-The generalisation from derivative w.r.t to character to
+The generalisation from the derivatives w.r.t a character to
-derivative w.r.t string is given as
+derivatives w.r.t strings is given as
 \begin{center}
 	\begin{tabular}{lcl}
 		$r \backslash_{rs} []$ & $\dn$ & $r$\\
 		$r \backslash_{rs} c::s$ & $\dn$ & $(r\backslash_r c) \backslash_{rs} s$
 	\end{tabular}
 \end{center}
 where $r = r_1 \cdot r_2$ or $r = r_0^*$ and so on.
 For example, for $r_1 \cdot r_2$ we have the equality as
 	\begin{center}
 		$ \rderssimp{r_1 \cdot r_2}{s} =
-		\rsimp{(\sum (r_1 \backslash s \cdot r_2 ) \; :: \;(\map \; \rderssimp{r2}{\_} \;(\vsuf{s}{r_1})))}$
+		\rsimp{(\sum (r_1 \backslash s \cdot r_2 ) \; :: \;(\map \; \rderssimp{r_2}{\_} \;(\vsuf{s}{r_1})))}$
 	\end{center}
 We call the right-hand-side the
 \emph{Closed Form} of $(r_1 \cdot r_2)\backslash_{rsimps} s$.
 Second, we will bound the closed form of r-regular expressions
 using some estimation techniques
 		\item
 			%If list $rs'$ is the result of $\rdistinct{rs}{acc}$,
 			$\textit{isDistinct} \;\;\; (\rdistinct{rs}{acc})$.
 		\item
 			$\textit{set} \; (\rdistinct{rs}{acc})
-			= (textit{set} \; rs) - acc$
+			= (\textit{set} \; rs) - acc$
 	\end{itemize}
 \end{lemma}
 \noindent
 \begin{proof}
 	The first part is by an induction on $rs$.
 \end{lemma}
 \begin{proof}
 	By \ref{rdistinctDoesTheJob} and \ref{distinctRemovesMiddle}.
 \end{proof}
 \noindent
-The next lemma is a more general form of \ref{rdistinctConcat},
+The next lemma is a more general form of \ref{rdistinctConcat};
-it says that
+It says that
 $\textit{rdistinct}$ is composable w.r.t list concatenation:
 \begin{lemma}\label{distinctRdistinctAppend}
 			If $\;\; \textit{isDistinct} \; rs_1$,
 			and $(set \; rs_1) \cap acc = \varnothing$,
 			then applying $\textit{rdistinct}$ on $rs_1 @ rs_a$ does not
 \end{proof}
 \noindent
 $\textit{rdistinct}$ needs to be applied only once, and
 applying it multiple times does not make any difference:
 \begin{corollary}\label{distinctOnceEnough}
-	$\textit{rdistinct} \; (rs @ rsa) {} = \textit{rdistinct} \; (rdistinct \;
+	$\textit{rdistinct} \; (rs @ rsa) {} = \textit{rdistinct} \; ( (rdistinct \;
-	rs \{ \} @ (\textit{rdistinct} \; rs_a \; (set \; rs)))$
+	rs \; \{ \}) @ (\textit{rdistinct} \; rs_a \; (set \; rs)))$
 \end{corollary}
 \begin{proof}
 	By lemma \ref{distinctRdistinctAppend}.
 \end{proof}
 	last sub-lemma.
 \end{proof}
 \noindent
 Now we introduce the property that the operations
 derivative and $\rsimpalts$
-commute, this will be used later on, when deriving the closed form for
+commute, this will be used later on when deriving the closed form for
 the alternative regular expression:
 \begin{lemma}\label{rderRsimpAltsCommute}
 	$\rder{x}{(\rsimpalts \; rs)} = \rsimpalts \; (\map \; (\rder{x}{\_}) \; rs)$
 \end{lemma}
 \begin{proof}
 	By induction on $rs$.
 \end{proof}
 \noindent
-\subsubsection{The $RL$ Function: Language Interpretation of $\textit{Rrexp}$s}
+\subsubsection{The $RL$ Function: Language Interpretation for $\textit{Rrexp}$s}
 Much like the definition of $L$ on plain regular expressions, one can also
-define the language interpretation of $\rrexp$s.
+define the language interpretation for $\rrexp$s.
 \begin{center}
 	\begin{tabular}{lcl}
-		$RL \; (\ZERO)$ & $\dn$ & $\phi$\\
+		$RL \; (\ZERO_r)$ & $\dn$ & $\phi$\\
-		$RL \; (\ONE)$ & $\dn$ & $\{[]\}$\\
+		$RL \; (\ONE_r)$ & $\dn$ & $\{[]\}$\\
 		$RL \; (c)$ & $\dn$ & $\{[c]\}$\\
 		$RL \; \sum rs$ & $\dn$ & $ \bigcup_{r \in rs} (RL \; r)$\\
 		$RL \; (r_1 \cdot r_2)$ & $\dn$ & $ RL \; (r_1) @ RL \; (r_2)$\\
 		$RL \; (r^*)$ & $\dn$ & $ (RL(r))^*$
 	\end{tabular}
 \end{lemma}
 \begin{proof}
 	By induction on $r$.
 \end{proof}
 \noindent
-With this we could prove that a regular expressions
+With this we can prove that a regular expressions
 after simplification and flattening and de-duplication,
 will not contain any alternative regular expression directly:
 \begin{lemma}\label{nonaltFltsRd}
 	If $x \in \rdistinct{\rflts\; (\map \; \rsimp{} \; rs)}{\varnothing}$
 	then $\textit{nonAlt} \; x$.
 	The first part is by induction on $rs$, where the induction
 	rule is the inductive cases for $\rflts$.
 	The second part is a corollary from the first part.
 \end{proof}
-And this leads to good structural property of $\rsimp{}$,
+This leads to good structural property of $\rsimp{}$,
 that after simplification, a regular expression is
 either good or $\RZERO$:
 \begin{lemma}\label{good1}
 	For any r-regular expression $r$, $\good \; \rsimp{r}$ or $\rsimp{r} = \RZERO$.
 \end{lemma}
 	by that as well.
 	The lemmas \ref{nonnestedRsimp} and  \ref{nonaltFltsRd} are used
 	to ensure that goodness is preserved at the topmost level.
 \end{proof}
 We shall prove that any good regular expression is
-a fixed-point for $\rsimp{}$.
+a fixed-point for $\textit{rsimp}$.
 First we prove an auxiliary lemma:
 \begin{lemma}\label{goodaltsNonalt}
 	If $\good \; \sum rs$, then $\rflts\; rs = rs$.
 \end{lemma}
 \begin{proof}
 	\ref{goodaltsNonalt} and \ref{rdistinctOnDistinct}.
 	The lemma \ref{goodaltsNonalt} is used in the alternative
 	case where 2 or more elements are present in the list.
 \end{proof}
 \noindent
-Given below is a property involving $\rflts$, $\textit{rdistinct}$,
+Below we show a property involving $\rflts$, $\textit{rdistinct}$,
 $\rsimp{}$ and $\rsimp_{ALTS}$,
 which requires $\ref{good1}$ to go through smoothly:
 \begin{lemma}\label{flattenRsimpalts}
 An application of $\rsimp_{ALTS}$ can be ``absorbed'',
 if its output is concatenated with a list and then applied to $\rflts$.
 \end{proof}
 \noindent
 This property means we do not have to repeatedly
 apply simplification in each step, which justifies
 our definition of $\blexersimp$.
+This is in contrast to the work of Sulzmann and Lu where
+the simplification is applied in a fixpoint manner.
 On the other hand, we can repeat the same $\rsimp{}$ applications
 on regular expressions as many times as we want, if we have at least
-one simplification applied to it, and apply it wherever we would like to:
+one simplification applied to it, and apply it wherever we need to:
 \begin{corollary}\label{headOneMoreSimp}
 	The following properties hold, directly from \ref{rsimpIdem}:
 	\begin{itemize}
 		\item
 \begin{proof}
 	By application of lemmas \ref{rsimpIdem} and \ref{good1}.
 \end{proof}
 \noindent
-With the idempotency of $\rsimp{}$ and its corollaries,
+With the idempotency of $\textit{rsimp}$ and its corollaries,
 we can start proving some key equalities leading to the
 closed forms.
-Now presented are a few equivalent terms under $\rsimp{}$.
+Next we present a few equivalent terms under $\textit{rsimp}$.
 To make the notation more concise
-We use $r_1 \sequal r_2 $ to denote that $\rsimp{r_1} = \rsimp{r_2}$:
+We use $r_1 \sequal r_2 $ to denote that $\rsimp{r_1} = \rsimp{r_2}$.
-\begin{center}
+%\begin{center}
-\begin{tabular}{lcl}
+%\begin{tabular}{lcl}
-	$a \sequal b$ & $ \dn$ & $ \textit{rsimp} \; a = \textit{rsimp} \; b$
+%	$a \sequal b$ & $ \dn$ & $ \textit{rsimp} \; a = \textit{rsimp} \; b$
-\end{tabular}
+%\end{tabular}
-\end{center}
+%\end{center}
-\noindent
+%\noindent
 %\vspace{0em}
 \begin{lemma}
 	The following equivalence hold:
 	\begin{itemize}
 	\item
 	\item
 		$\RALTS{\RALTS{rs}} \sequal \RALTS{rs}$
 	\item
 		$\sum ((\sum rs_a) :: rs_b) \sequal \sum rs_a @ rs_b$
 	\item
-		$\RALTS{rs} = \RALTS{\map \; \rsimp{} \; rs}$
+		$\RALTS{rs} \sequal \RALTS{\map \; \rsimp{} \; rs}$
 \end{itemize}
 \end{lemma}
 \begin{proof}
 	By induction on the lists involved.
 \end{proof}
 \end{figure}
 \noindent
 We define
 two separate list rewriting relations $\frewrite$ and $\grewrite$.
 The rewriting steps that take place during
-flattening is characterised by $\frewrite$.
+flattening are characterised by $\frewrite$.
 The rewrite relation $\grewrite$ characterises both flattening and de-duplicating.
 Sometimes $\grewrites$ is slightly too powerful
 so we would rather use $\frewrites$ to prove
 %because we only
 equalities related to $\rflts$.
 \end{lemma}
 \noindent
 These two lemmas can both be proven using a straightforward induction (and
 the proofs for them are therefore omitted).
-Now the above equalities can be derived like a breeze:
+Now the above equalities can be derived with ease:
 \begin{corollary}
 	$\sum (\rDistinct \;\; (\map \; (\_ \backslash x) \; (\rflts \; rs)) \;\; \varnothing) \sequal
 	\sum (\rDistinct \;\;  (\rflts \; (\map \; (\_ \backslash x) \; rs)) \;\; \varnothing)
 	$
 \end{corollary}
 \subsubsection{Terms That Can Be Rewritten Using $\hrewrites$, $\grewrites$, and $\frewrites$}
 In this part, we present lemmas stating
 pairs of r-regular expressions and r-regular expression lists
-where one could rewrite from one in many steps to the other.
+where one can rewrite from one in many steps to the other.
 Most of the proofs to these lemmas are straightforward, using
 an induction on the corresponding rewriting relations.
 These proofs will therefore be omitted when this is the case.
 We present in the following lemma a few pairs of terms that are rewritable via
 $\grewrites$:
 	\end{itemize}
 \end{lemma}
 \noindent
 Now comes the core of the proof,
 which says that once two lists are rewritable to each other,
-then they are equivalent under $\rsimp{}$:
+then they are equivalent under $\textit{rsimp}$:
 \begin{lemma}
 	If $r_1 \hrewrites r_2$ then $r_1 \sequal r_2$.
 \end{lemma}
 \noindent
-Similar to what we did in \ref{Bitcoded2},
+Similar to what we did in chapter \ref{Bitcoded2},
 we prove that if one can rewrite from one r-regular expression ($r$)
-to the other ($r'$), after taking derivatives one could still rewrite
+to the other ($r'$), after taking derivatives one can still rewrite
 the first ($r\backslash c$) to the other ($r'\backslash c$).
 \begin{lemma}\label{interleave}
 	If $r \hrewrites r' $ then $\rder{c}{r} \hrewrites \rder{c}{r'}$
 \end{lemma}
 \noindent
 \end{center}
 \noindent
 The $\delta$ function
 returns $r$ when the boolean condition
 $b$ evaluates to true and
-$\ZERO$ otherwise:
+$\ZERO_r$ otherwise:
 \begin{center}
 	\begin{tabular}{lcl}
 		$\delta \; b\; r$ & $\dn$ & $r \quad \textit{if} \; b \; is \;\textit{true}$\\
-				  & $\dn$ & $\ZERO \quad otherwise$
+				  & $\dn$ & $\ZERO_r \quad otherwise$
 	\end{tabular}
 \end{center}
 \noindent
 Note that the term
 \begin{center}
 \[
 	r_1 \backslash_r c_1c_2
 \]
 instead of
 \[
-	(r_1 \backslash_r c_1c_2 + \ZERO ) + \ZERO.
+	(r_1 \backslash_r c_1c_2 + \ZERO_r ) + \ZERO_r.
 \]
-The redundant $\ZERO$s will not be created in the
+The redundant $\ZERO_r$s will not be created in the
 first place.
 In a closed-form one needs to take into account this (because
 closed forms require exact equality rather than language equivalence)
 and only generate the
 $r_2 \backslash_r s''$ terms satisfying the property
 is occurring in $(r_1\cdot r_2)\backslash s$.
 With $\textit{Suffix}$ we are ready to express the
 sequence regular expression's closed form,
 but before doing so
-more devices are needed.
+more definitions are needed.
 The first thing is the flattening function $\sflat{\_}$,
 which takes an alternative regular expression and produces a flattened version
 of that alternative regular expression.
 It is needed to convert
 a left-associative nested sequence of alternatives into
 \begin{center}
 	\begin{tabular}{lcl}
 		$\sflataux{[]}'$ & $ \dn $ & $ []$\\
 		$\sflataux{ (r_1 + r_2) :: rs }'$ & $\dn$ & $r_1 :: r_2 :: rs$\\
-		$\sflataux{r :: rs}$ & $\dn$ & $ r::rs$
+		$\sflataux{r :: rs}'$ & $\dn$ & $ r::rs$
 	\end{tabular}
 \end{center}
 \noindent
 $\sflataux{\_}$ breaks up nested alternative regular expressions
 of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape
 \begin{proof}
 	By a reverse induction on the string $s$, where the inductive cases are $[]$
 	and $xs  @ [x]$.
 \end{proof}
 \noindent
-If we have a regular expression $r$ whose shpae
+If we have a regular expression $r$ whose shape
 fits into those described by $\textit{createdBySequence}$,
 then we can convert between
 $r \backslash_r c$ and $(\sflataux{r}) \backslash_r c$ with
 $\sflataux{\_}'$:
 \begin{lemma}\label{sfauIdemDer}
 	If $\textit{createdBySequence} \; r$, then
 	\begin{center}
 Now we are ready to express
 the shape of $r_1 \cdot r_2 \backslash s$
 \begin{lemma}\label{seqSfau0}
 	$\sflataux{(r_1 \cdot r_2) \backslash_r s} = (r_1 \backslash_r s) \cdot r_2
-	:: (\map \; (r_2 \backslash_r \_) \; (\textit{Suffix} \; s \; r1))$
+	:: (\map \; (r_2 \backslash_r \_) \; (\textit{Suffix} \; s \; r_1))$
 \end{lemma}
 \begin{proof}
 	By a reverse induction on the string $s$, where the inductive cases
 	are $[]$ and $xs @ [x]$.
 	For the inductive case, we know that $\textit{createdBySequence} \; ((r_1 \cdot r_2)
 %allowing people investigating derivatives to get an alternative
 %view of what $r_1 \cdot r_2$ is.
 We now use $\textit{createdBySequence}$ and
 $\sflataux{\_}$ to describe an intuition
-behind the closed form of the sequence closed form.
+behind the sequence closed form.
 If two regular expressions only differ in the way their
 alternatives are nested, then we should be able to get the same result
 once we apply simplification to both of them:
 \begin{lemma}\label{sflatRsimpeq}
 	If $r$ is created from a sequence through
 	\map\; (r_2 \backslash \_) \; (\vsuf{s}{r_1})))}$
 \end{lemma}
 \begin{proof}
 	We know that
 	$\sflataux{(r_1 \cdot r_2) \backslash_r s} = (r_1 \backslash_r s) \cdot r_2
-	:: (\map \; (r_2 \backslash_r \_) \; (\textit{Suffix} \; s \; r1))$
+	:: (\map \; (r_2 \backslash_r \_) \; (\textit{Suffix} \; s \; r_1))$
 	holds
 	by lemma \ref{seqSfau0}.
 	This allows the theorem to go through because of lemma \ref{sflatRsimpeq}.
 \end{proof}
 Together with the idempotency property of $\rsimp{}$ (lemma \ref{rsimpIdem}),
 	r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$
 \end{center}
 which can be de-duplicated by $\rDistinct$
 and therefore bounded finitely.
-and then de-duplicate terms of the form  ($s'$ being a substring of $s$).
+%and then de-duplicate terms of the form  ($s'$ being a substring of $s$).
-This allows us to use a similar technique as $r_1 \cdot r_2$ case,
+%This allows us to use a similar technique as $r_1 \cdot r_2$ case,
 Now the crux of this section is finding a suitable description
 for $rs$ where
 \begin{center}
 	$\rderssimp{r^*}{s} = \rsimp{\sum rs}$.
 	By an induction on $s$, the inductive cases
 	being $[]$ and $s@[c]$. The lemmas \ref{hfauPushin} and \ref{starDersCbs} are used.
 \end{proof}
 \noindent
-$\hflataux{\_}$ has a similar effect as $\textit{flatten}$:
+The function $\hflataux{\_}$ has a similar effect as $\textit{flatten}$:
 \begin{lemma}\label{hflatauxGrewrites}
 	$a :: rs \grewrites \hflataux{a} @ rs$
 \end{lemma}
 \begin{proof}
 	By induction on $a$. $rs$ is set to take arbitrary values.
 \end{lemma}
 \begin{proof}
 	By using the rewriting relation $\rightsquigarrow$
 \end{proof}
-And from this we obtain a proof that a star's derivative will be the same
+And from this we obtain that a
-as if it had all its nested alternatives created during deriving being flattened out:
+regular expression created by star
+is the same as its flattened version, up to equivalence under $\textit{bsimp}$.
 For example,
 \begin{lemma}\label{hfauRsimpeq2}
-	$\createdByStar{r} \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$
+	$\textit{createdByStar} \; r \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$
 \end{lemma}
 \begin{proof}
 	By structural induction on $r$, where the induction rules
 	are these of $\createdByStar{\_}$.
 	Lemma \ref{cbsHfauRsimpeq1} is used in the inductive case.
 	$\textit{finite} \; (\textit{sizeNregex} \; N)$ holds.
 \end{lemma}
 \begin{proof}
 	By splitting the set $\textit{sizeNregex} \; (N + 1)$ into
 	subsets by their categories:
-	$\{\ZERO, \ONE, c\}$, $\{* `` \textit{sizeNregex} \; N\}$,
+	$\{\ZERO_r, \ONE_r, c\}$, $\{r^* \mid r \in \textit{sizeNregex} \; N\}$,
 	and so on. Each of these subsets are finitely bounded.
 \end{proof}
 \noindent
 From this we get a corollary that
 if forall $r \in rs$, $\rsize{r} \leq N$, then the output of
 %output of $\textit{rdistinct} \; rs' \; \varnothing$
 %is bounded by a constant $N * c_N$ depending only on $N$,
 %provided that each of $rs'$'s element
 %is bounded by $N$.
-\subsection{$\textit{rsimp}$ Does Not Increment the Size}
+\subsection{$\textit{rsimp}$ Does Not Increase the Size}
 Although it seems evident, we need a series
 of non-trivial lemmas to establish that functions such as $\rflts$
 do not cause the regular expressions to grow.
 \begin{lemma}\label{rsimpMonoLemmas}
 	\mbox{}
 	By \ref{rsimpMonoLemmas}.
 \end{proof}
 \subsection{Estimating the Closed Forms' sizes}
 We recap the closed forms we obtained
-earlier by putting them together down below:
+earlier:
 \begin{itemize}
 	\item
 		$\rderssimp{(\sum rs)}{s} \sequal
 		\sum \; (\map \; (\rderssimp{\_}{s}) \; rs)$
 	\item
 which will each have a size uppder bound
 according to inductive hypothesis, which controls $r \backslash s$.
 We elaborate the above reasoning by a series of lemmas
 below, where straightforward proofs are omitted.
+%We want to apply it to our setting $\rsize{\rsimp{\sum rs}}$.
+We show that $\textit{rdistinct}$ and $\rflts$
-We want to apply it to our setting $\rsize{\rsimp{\sum rs}}$.
-We show that $\rdistinct$ and $\rflts$
 working together is at least as
-good as $\rdistinct{}{}$ alone, which can be written as
+good as $\textit{rdistinct}$ alone, which can be written as
 \begin{center}
 	$\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r
 	\leq
 	\llbracket \rdistinct{rs}{\varnothing}  \rrbracket_r  $.
 \end{center}
 	\begin{center}
 	\begin{tabular}{lcl}
 	$\llbracket  (\textit{rdistinct} \; (\textit{rflts} \; as) \;
 	(noalts\_set \cup corr\_set)) \rrbracket_r$ & $\leq$ &\\
 						    $\llbracket  (\textit{rdistinct} \; as \; (noalts\_set \cup alts\_set \cup
-	\{ \ZERO \} )) \rrbracket_r$ & & \\
+	\{ \ZERO_r \} )) \rrbracket_r$ & & \\
 	\end{tabular}
 	\end{center}
 		holds.
 \end{lemma}
 \noindent
 \end{center}
 \noindent
 (5) is by theorem \ref{starClosedForm}.
 (6) is by \ref{altsSimpControl}.
 (7) is by corollary \ref{finiteSizeNCorollary}.
-Combining with the case when $s = []$, one gets
+Combining with the case when $s = []$, one obtains
 \begin{center}
 	\begin{tabular}{lcll}
 		$\rsize{r^* \backslash_r s}$ & $\leq$ & $max \; n_r \; 1 + (\textit{card} (\sizeNregex \; (N + n_r)))
 		* (1 + (N + n_r)) $ & (8)\\
 	\end{tabular}
 	\end{tabular}
 \end{center}
 \noindent
 (9) is by theorem \ref{altsClosedForm}, (10) by lemma \ref{rsimpMono} and (11) by inductive hypothesis.
-Combining with the case when $s = []$, one gets
+Combining with the case when $s = []$, we obtain
 \begin{center}
 	\begin{tabular}{lcll}
 		$\rsize{\sum rs \backslash_r s}$ & $\leq$ & $max \; \rsize{\sum rs} \; 1+N*(length \; rs)$
 						 & (12)\\
 	\end{tabular}
 For the closed form to be bounded, we would like
 simplification to be applied to each term in the list.
 Therefore we introduce some variants of $\opterm$,
 which help conveniently express the rewriting steps
 needed in the closed form proof.
+We have $\optermOsimp$, $\optermosimp$ and $\optermsimp$
+with slightly different spellings because they help the proof to go through:
 \begin{center}
 	\begin{tabular}{lcl}
 	$\optermOsimp \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\
 				 & & $\;\;\Some \; (s, n) \Rightarrow
 				 \textit{rsimp} \; ((r\backslash_{rs} s)\cdot r^{\{n\}})$\\
 	\textit{max} \; (c_N+1)* (N + \llbracket r^{\{n\}} \rrbracket+1)$,
 	where $c_N = \textit{card} \; (\textit{sizeNregex} \; (
 	N + \llbracket r^{\{n\}} \rrbracket_r+1))$.
 \end{theorem}
 \begin{proof}
-We have that for all regular expressions $r'$ in $\textit{set} \; (\map \; (\optermsimp \; r) \; (
+We have that for all regular expressions $r'$ in
+\begin{center}
+$\textit{set} \; (\map \; (\optermsimp \; r) \; (
 	\nupdates \; s \; r \; [\Some \; ([c], n)]))$,
+\end{center}
 	$r'$'s size is less than or equal to $N + \llbracket r^{\{n\}}
 	\rrbracket_r + 1$
 because $r'$ can only be either a $\ZERO$ or $r \backslash_{rsimps} s' \cdot
 r^{\{m\}}$ for some string $s'$ and number
 $m \leq n$ (lemma \ref{ntimesClosedFormListElemShape}).
 \section{Comments and Future Improvements}
 \subsection{Some Experimental Results}
 What guarantee does this bound give us?
-Whatever the regex is, it will not grow indefinitely.
+It states that whatever the regex is, it will not grow indefinitely.
 Take our previous example $(a + aa)^*$ as an example:
 \begin{center}
 	\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
 		\begin{tikzpicture}
 			\begin{axis}[
 We will discuss improvements to this bound in the next chapter.
 \subsection{Possible Further Improvements}
-There are two problems with this finiteness result, though.\\
+There are two problems with this finiteness result, though:\\
 (i)
 		First, it is not yet a direct formalisation of our lexer's complexity,
 		as a complexity proof would require looking into
 		the time it takes to execute {\bf all} the operations
 		involved in the lexer (simp, collect, decode), not just the derivative.\\
 	\item
 		The bound is already a strong indication that catastrophic
 		backtracking is much less likely to occur in our $\blexersimp$
 		algorithm.
 		We refine $\blexersimp$ with $\blexerStrong$ in the next chapter
-		so that the bound becomes polynomial.
+		so that we conjecture the bound becomes polynomial.
 \end{itemize}
 %----------------------------------------------------------------------------------------
 %	SECTION 4
 %----------------------------------------------------------------------------------------
 \begin{center}
 $(\sum_{i = 1}^{n}  (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\
 [1mm]
 	$(1 \leq m' \leq m )$
 \end{center}
-There at at least exponentially
+There are at least exponentially
 many such terms.\footnote{To be exact, these terms are
 distinct for $m' \leq L.C.M.(1, \ldots, n)$, the details are omitted,
 but the point is that the number is exponential.
 }
 With each new input character taking the derivative against the intermediate result, more and more such distinct

changeset 639	80cc6dc4c98b
parent 638	dd9dde2d902b
child 640	bd1354127574