lexing: changeset 640:bd1354127574

--- a/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex	Fri Dec 30 17:37:51 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex	Fri Dec 30 23:41:44 2022 +0000
@@ -14,7 +14,7 @@
 regular expressions. 
 We have implemented their algorithm in Scala and Isabelle, 
 and found problems 
-in their algorithm such as de-duplication not working properly and redundant
+in their algorithm, such as de-duplication not working properly and redundant
 fixpoint construction. 
 \section{The Motivation Behind Using Bitcodes}
 Let us give again the definition of $\lexer$ from Chapter \ref{Inj}:
@@ -113,7 +113,7 @@
 \caption{Second derivative taken}
 \end{figure}
 \noindent
-As the number of derivative steps increase,
+As the number of derivative steps increases,
 the stack would increase:
 
 \begin{figure}[H]
@@ -327,12 +327,12 @@
 as the lexing proceeds.
 It becomes unnecessary
 to remember all the 
-intermediate expresssions, but only the most recent one
+intermediate expressions, but only the most recent one
 with this bit-carrying regular expression.
 Annotated regular expressions 
 are defined as the following 
 Isabelle datatype:\footnote{ We use subscript notation to indicate
-	that the bitcodes are auxiliary information that do not
+	that the bitcodes are auxiliary information that does not
 interfere with the structure of the regular expressions }
 \begin{center}
 \begin{tabular}{lcl}
@@ -452,9 +452,8 @@
 \end{tikzpicture}
 \caption{A bird's eye view of $\blexer$. The $_{bsi}a_{i}$s stand
 	for the annotated regular expressions in each derivative step.
-	The characters used in each derivative is written as $c_i$.
-The size of the derivatives typically increase during each derivative step,
-therefore we draw the larger circles to indicate that.
+	The characters used in each derivative are written as $c_i$.
+The size of the derivatives typically increases during each derivative step.
 The process starts with an internalise phase and concludes with a decoding phase.}
 \end{figure}
 \noindent
@@ -540,7 +539,7 @@
 The opposite of $\textit{internalise}$ is
 $\erase$, where all bit-codes are removed,
 and the alternative operator $\sum$ for annotated
-regular expressions is transformed to the binary version 
+regular expressions is transformed into the binary version 
 in (plain) regular expressions. This can be defined as follows:
 \begin{center}\label{eraseDef}
 	\begin{tabular}{lcl}
@@ -648,13 +647,13 @@
 The reverse operation of $\decode$ is $\code$.
 This function encodes a value into a bitcode by converting
 $\Left$ into $Z$, $\Right$ into $S$, and marks the start of any non-empty
-star iteration by $S$. The border where a star iteration
-terminates is marked by $Z$. 
+star iteration by $S$. $ Z$ marks the border where a star iteration
+terminates. 
 This coding is lossy, as it throws away the information about
 characters, and does not encode the ``boundary'' between two
 sequence values. Moreover, with only the bitcode we cannot even tell
 whether the $S$s and $Z$s are for $\Left/\Right$ or $\Stars$,
-but this will not be necessary in our correctness proof.
+but this will not be necessary for our correctness proof.
 \begin{center}
 \begin{tabular}{lcl}
   $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
@@ -678,8 +677,7 @@
 \begin{proof}
 By proving a more general version of the lemma, on $\decode'$:
 \[\vdash v : r \implies \decode' \; ((\code \; v) @ ds) \; r = (v, ds) \]
-Then setting $ds$ to be $[]$ and unfolding $\decode$ definition,
-we obtain the property.
+Then we obtain the property by setting $ds$ to be $[]$ and unfolding the $\decode$ definition.
 \end{proof}
 \noindent
 Now we define the central part of Sulzmann and Lu's second lexing algorithm,
@@ -856,7 +854,7 @@
 then test whether the result is (b)nullable.
 If yes, then extract the bitcodes from the nullable expression,
 and decodes the bitcodes into a lexical value.
-If there does not exists a match between $r$ and $s$ the lexer
+If there does not exist a match between $r$ and $s$, the lexer
 outputs $\None$ indicating a mismatch.
 Ausaf and Urban formally proved the correctness of the $\blexer$, namely
 \begin{property}
@@ -1068,7 +1066,7 @@
 $\flex$.
 \subsection{Specifications of Some Helper Functions}
 The functions we introduce will give a more detailed view into 
-the lexing process, which is not be possible
+the lexing process, which is not possible
 using $\lexer$ or $\blexer$ alone.
 \subsubsection{$\textit{Retrieve}$}
 The first function we shall introduce is $\retrieve$.
@@ -1232,7 +1230,7 @@
 	$\vdash v : (r\backslash c) \implies \retrieve \; (r \backslash c)  \;  v= \retrieve \; r \; (\inj \; r\; c\; v)$
 \end{lemma}
 \begin{proof}
-	By induction on $r$, generalising over $v$.
+	We do an induction on $r$, generalising over $v$.
 	The induction principle in our formalisation
 	is function $\erase$'s cases.
 \end{proof}
@@ -1262,7 +1260,7 @@
 \end{tabular}
 \end{center}
 which accumulates the characters that need to be injected back, 
-and does the injection in a stack-like manner (the last taken derivative is first injected).
+and does the injection in a stack-like manner (the last character being chopped off in the derivatives phase is first injected).
 The function
 $\flex$ can calculate what $\lexer$ calculates, given the input regular expression
 $r$, the identity function $id$, the input string $s$ and the value

--- a/ChengsongTanPhdThesis/Chapters/Bitcoded2.tex	Fri Dec 30 17:37:51 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Bitcoded2.tex	Fri Dec 30 23:41:44 2022 +0000
@@ -16,7 +16,7 @@
 each intermediate derivative result. This allows
 us to make $\blexer$ much more efficient.
 Sulzmann and Lu already introduced some simplifications for bitcoded regular expressions,
-but their simplification functions  were inefficient and in some cases needed fixing.
+but their simplification functions could have been more efficient and in some cases needed fixing.
 %We contrast our simplification function 
 %with Sulzmann and Lu's, indicating the simplicity of our algorithm.
 %This is another case for the usefulness 
@@ -42,7 +42,7 @@
 	\end{tabular}
 \end{center}
 \noindent
-As can be seen, there are serveral duplications.
+As can be seen, there are several duplications.
 A simple-minded simplification function cannot simplify
 the third regular expression in the above chain of derivative
 regular expressions, namely
@@ -50,7 +50,7 @@
 $((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*$
 \end{center}
 because the duplicates are
-not next to each other and therefore the rule
+not next to each other, and therefore the rule
 $r+ r \rightarrow r$ from $\textit{simp}$ does not fire.
 One would expect a better simplification function to work in the 
 following way:
@@ -182,7 +182,7 @@
 $ using Sulzmann and Lu's lexer}\label{SulzmannLuLexer}
 \end{figure}
 \noindent
-At $n= 20$ we already get an out of memory error with Scala's normal 
+At $n= 20$ we already get an out-of-memory error with Scala's normal 
 JVM heap size settings.
 In fact their simplification does not improve much over
 the simple-minded simplifications we have shown in \ref{fig:BetterWaterloo}.
@@ -274,7 +274,7 @@
 an entire list to open up possibilities for further simplifications
 with later regular expressions.
 Not flattening the rest of the elements also means that
-the later de-duplication processs 
+the later de-duplication process 
 does not fully remove further duplicates.
 For example,
 using $\textit{simp}\_{SL}$ we cannot
@@ -418,13 +418,13 @@
 \end{center}
 \noindent
 The most involved part is the $\sum$ clause, where we first call $\flts$ on
-the simplified children regular expression list $\textit{map}\; \textit{bsimp}\; \textit{as}$.
-and then call $\distinctBy$ on that list, the predicate determining whether two 
+the simplified children regular expression list $\textit{map}\; \textit{bsimp}\; \textit{as}$,
+and then call $\distinctBy$ on that list. The predicate used in $\distinctBy$ for determining whether two 
 elements are the same is $\rerases \; r_1 = \rerases\; r_2$.
 Finally, depending on whether the regular expression list $as'$ has turned into a
 singleton or empty list after $\flts$ and $\distinctBy$, $\textit{bsimp}_{ALTS}$
 decides whether to keep the current level constructor $\sum$ as it is, and 
-removes it when there are less than two elements:
+removes it when there are fewer than two elements:
 \begin{center}
 	\begin{tabular}{lcl}
 		$\textit{bsimp}_{ALTS} \; bs \; as'$ & $ \dn$ & $ as' \; \textit{match}$\\		
@@ -621,7 +621,7 @@
 The rules $LT$ and $LH$ are for rewriting two regular expression lists
 such that one regular expression
 in the left-hand-side list is rewritable in one step
-to the right-hand-side's regular expression at the same position.
+to the right-hand side's regular expression at the same position.
 This helps with defining the ``context rule'' $AL$.
 
 The reflexive transitive closure of $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$
@@ -677,7 +677,7 @@
 			$r \rightsquigarrow^* r' \implies _{bs} \sum (r :: rs)\; \rightsquigarrow^*\;  _{bs} \sum (r' :: rs)$
 
 		\item
-			The rewriting in many steps property is composible 
+			The rewriting in many steps property is composable 
 			in terms of the sequence constructor:\\
 			$r_1 \rightsquigarrow^* r_2 
 			\implies _{bs} r_1 \cdot r_3 \rightsquigarrow^* \;  
@@ -704,7 +704,7 @@
 	by the properties $r_1 \rightsquigarrow r_2 \implies \fuse \; bs \; r_1 \implies \fuse \; bs \; r_2$ and
 	$rs_2 \stackrel{s}{\rightsquigarrow} rs_3 
 	\implies \map \; (\fuse \; bs) rs_2 \stackrel{s*}{\rightsquigarrow} \map \; (\fuse \; bs)\; rs_3$,
-	which can be indutively proven by the inductive cases of $\rightsquigarrow$ and 
+	which can be inductively proven by the inductive cases of $\rightsquigarrow$ and 
 	$\stackrel{s}{\rightsquigarrow}$.
 \end{proof}
 \noindent
@@ -781,7 +781,7 @@
 
 For convenience, 
 we define a predicate for a list of regular expressions
-having at least one nullable regular expressions:
+having at least one nullable regular expression:
 \begin{center}
 	$\textit{bnullables} \; rs \quad \dn \quad \exists r \in rs. \;\; \bnullable \; r$
 \end{center}
@@ -803,7 +803,7 @@
 \end{lemma}
 \begin{proof}
 	By rule induction of $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$.
-	The third point is a corollary of the second.
+	The third point is a result of the second.
 \end{proof}
 \noindent
 For convenience again,
@@ -862,7 +862,7 @@
 	(rs_1 @ (\distinctBy \; rs_2 \; \; \rerases \;\; (\map\;\; \rerases \; \; rs_1)))$
 \end{lemma}
 \noindent
-It says that that for a list made of two parts $rs_1 @ rs_2$, 
+It says that for a list made of two parts $rs_1 @ rs_2$, 
 one can throw away the duplicate
 elements in $rs_2$, as well as those that have appeared in $rs_1$.
 \begin{proof}
@@ -946,7 +946,7 @@
 	and $\stackrel{s}{\rightsquigarrow}$, using a number of the previous lemmas.
 \end{proof}
 \noindent
-Now we can prove property 3, as an immediate corollary:
+Now we can prove property 3 as an immediate corollary:
 \begin{corollary}\label{rewritesBder}
 	$r_1 \rightsquigarrow^* r_2 \implies r_1 \backslash c \rightsquigarrow^*   
 	r_2 \backslash c$
@@ -983,7 +983,7 @@
 		$\bnullable \; (a \backslash s) 
 		\implies \bmkeps \; (a \backslash s) = \bmkeps \; (\bderssimp{a}{s})$
 	\end{center}
-	Now that they generate the same bits, we know that they give the same value after decoding.
+	Now that they generate the same bits, we know they also give the same value after decoding.
 	\begin{center}
 		$\bnullable \; (a \backslash s) 
 		\implies \decode \; r \; (\bmkeps \; (a \backslash s)) = 
@@ -1011,7 +1011,7 @@
 \end{corollary}
 
 \subsection{Comments on the Proof Techniques Used}
-Straightforward and simple as the proof may seem,
+Straightforward as the proof may seem,
 the efforts we spent obtaining it were far from trivial.
 We initially attempted to re-use the argument 
 in \cref{flex_retrieve}. 
@@ -1060,13 +1060,13 @@
 	$_{Z}(_{Z} \ONE + _{S} c)$
 
 \end{center}
-as equal, because they were both re-written
+as equal because they were both re-written
 from the same expression.
 
 The simplification rewriting rules
 given in \ref{rrewriteRules} are by no means
 final,
-one could come up new rules
+one could come up with new rules
 such as 
 $\SEQ r_1 \cdot (\SEQ r_1 \cdot r_3) \rightarrow
 \SEQs [r_1, r_2, r_3]$.

--- a/ChengsongTanPhdThesis/Chapters/Finite.tex	Fri Dec 30 17:37:51 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Finite.tex	Fri Dec 30 23:41:44 2022 +0000
@@ -8,12 +8,12 @@
 %regex's derivatives. 
 
 
-In this chapter we give a bound in terms of size of 
+In this chapter we give a bound in terms of the size of 
 the calculated derivatives: 
 given an annotated regular expression $a$, for any string $s$
 our algorithm $\blexersimp$'s derivatives
 are finitely bounded
-by a constant  that only depends on $a$.
+by a constant that only depends on $a$.
 Formally we show that there exists an $N_a$ such that
 \begin{center}
 	$\llbracket \bderssimp{a}{s} \rrbracket \leq N_a$
@@ -22,9 +22,9 @@
 where the size ($\llbracket \_ \rrbracket$) of 
 an annotated regular expression is defined
 in terms of the number of nodes in its 
-tree structure (its recursive definition given in the next page).
+tree structure (its recursive definition is given in the next page).
 We believe this size bound
-is important in the context of POSIX lexing, because 
+is important in the context of POSIX lexing because 
 \begin{itemize}
 	\item
 		It is a stepping stone towards the goal 
@@ -37,8 +37,8 @@
 		is not just finite but polynomial in $\llbracket a\rrbracket$.
 	\item
 		Having the finite bound formalised 
-		gives us a higher confidence that
-		our simplification algorithm $\simp$ does not ``mis-behave''
+		gives us higher confidence that
+		our simplification algorithm $\simp$ does not ``misbehave''
 		like $\textit{simpSL}$ does.
 		The bound is universal for a given regular expression, 
 		which is an advantage over work which 
@@ -121,7 +121,7 @@
 stay below a size bound $N_a$ determined by the input $a$.
 
 \noindent
-Sulzmann and Lu's assumed a similar picture about their algorithm,
+Sulzmann and Lu's assumed a similar picture of their algorithm,
 though in fact their algorithm's size might be better depicted by the following graph:
 \begin{figure}[H]
 	\begin{tikzpicture}[scale=2,
@@ -162,11 +162,11 @@
 They tested out the run time of their
 lexer on particular examples such as $(a+b+ab)^*$
 and claimed that their algorithm is linear w.r.t to the input.
-With our mecahnised proof, we avoid this type of unintentional
+With our mechanised proof, we avoid this type of unintentional
 generalisation.
 
-Before delving in to the details of the formalisation,
-we are going to provide an overview of it in the next subsection.
+Before delving into the details of the formalisation,
+we are going to provide an overview of it in the following subsection.
 
 
 \subsection{Overview of the Proof}
@@ -236,7 +236,7 @@
 allowing a more convenient size bound proof.
 %Of course, the bits which encode the lexing information 
 %would grow linearly with respect 
-%to the input, which should be taken into account when we wish to tackle the runtime comlexity.
+%to the input, which should be taken into accounte when we wish to tackle the runtime complexity.
 %But for the sake of the structural size 
 %we can safely ignore them.\\
 The datatype 
@@ -269,10 +269,7 @@
 \noindent
 The $r$ in the subscript of $\llbracket \rrbracket_r$ is to 
 differentiate with the same operation for annotated regular expressions.
-Adding $r$ as subscript will be used in 
-other operations as well.\\
-The transformation from an annotated regular expression
-to an r-regular expression is straightforward.
+Similar subscripts will be added for operations like $\rerase{}$:
 \begin{center}
 	\begin{tabular}{lcl}
 		$\rerase{\ZERO}$ & $\dn$ & $\RZERO$\\
@@ -314,7 +311,7 @@
 The  annotated regular expression $\sum[a, b, c]$ would turn into
 $(a+(b+c))$.
 All these operations change the size and structure of
-an annotated regular expressions, adding unnecessary 
+an annotated regular expression, adding unnecessary 
 complexities to the size bound proof.
 
 For example, if we define the size of a basic plain regular expression 
@@ -522,9 +519,9 @@
 %\begin{itemize}
 %	\item
 %		First, we rewrite $r\backslash s$ into something else that is easier
-%		to bound. This step is especially important for the inductive case 
+%		to bound. This step is crucial for the inductive case 
 %		$r_1 \cdot r_2$ and $r^*$, where the derivative can grow and bloat in a wild way,
-%		but after simplification they will always be equal or smaller to a form consisting of an alternative
+%		but after simplification, they will always be equal or smaller to a form consisting of an alternative
 %		list of regular expressions $f \; (g\; (\sum rs))$ with some functions applied to it, where each element will be distinct after the function application.
 %	\item
 %		Then, for such a sum  list of regular expressions $f\; (g\; (\sum rs))$, we can control its size
@@ -580,7 +577,7 @@
 in the simplification, which prevents the $rs$ from growing indefinitely.
 
 Based on this idea, we develop a proof in two steps.
-First, we show the equality (where
+First, we show the below equality (where
 $f$ and $g$ are functions that do not increase the size of the input)
 \begin{center}
 $r\backslash_{rsimps} s = f\; (\textit{rdistinct} \; (g\; \sum rs))$,
@@ -604,7 +601,7 @@
 
 \section{Closed Forms}
 In this section we introduce in detail
-how we express the string derivatives
+how to express the string derivatives
 of regular expressions (i.e. $r \backslash_r s$ where $s$ is a string
 rather than a single character) in a different way than 
 our previous definition.
@@ -623,16 +620,16 @@
 We call such more concrete representations the ``closed forms'' of
 string derivatives as opposed to their original definitions.
 The terminology ``closed form'' is borrowed from mathematics,
-which usually describe expressions that are solely comprised of
+which usually describe expressions that are solely comprised of finitely many
 well-known and easy-to-compute operations such as 
-additions, multiplications, exponential functions.
+additions, multiplications, and exponential functions.
 
 We start by proving some basic identities
 involving the simplification functions for r-regular expressions.
 After that we introduce the rewrite relations
 $\rightsquigarrow_h$, $\rightsquigarrow^*_{scf}$
 $\rightsquigarrow_f$ and $\rightsquigarrow_g$.
-These relations involves similar techniques as in chapter \ref{Bitcoded2}
+These relations involve similar techniques as in chapter \ref{Bitcoded2}
 for annotated regular expressions.
 Finally, we use these identities to establish the
 closed forms of the alternative regular expression,
@@ -645,14 +642,14 @@
 
 In what follows we will often convert between lists
 and sets.
-We use Isabelle's $set$ to refere to the 
+We use Isabelle's $set$ to refer to the 
 function that converts a list $rs$ to the set
 containing all the elements in $rs$.
 \subsubsection{$\textit{rdistinct}$'s Does the Job of De-duplication}
 The $\textit{rdistinct}$ function, as its name suggests, will
 de-duplicate an r-regular expression list.
 It will also remove any elements that 
-is already in the accumulator set.
+are already in the accumulator set.
 \begin{lemma}\label{rdistinctDoesTheJob}
 	%The function $\textit{rdistinct}$ satisfies the following
 	%properties:
@@ -675,7 +672,7 @@
 \noindent
 \begin{proof}
 	The first part is by an induction on $rs$.
-	The second and third part can be proven by using the 
+	The second and third parts can be proven by using the 
 	inductive cases of $\textit{rdistinct}$.
 
 \end{proof}
@@ -804,16 +801,16 @@
 We give in this subsection some properties
 involving $\backslash_r$, $\backslash_{rsimps}$, $\textit{rflts}$ and 
 $\textit{rsimp}_{ALTS} $, together with any non-trivial lemmas that lead to them.
-These will be helpful in later closed form proofs, when
+These will be helpful in later closed-form proofs, when
 we want to transform derivative terms which have
 %the ways in which multiple functions involving
 %those are composed together
-interleaving derivatives and  simplifications applied to them.
+interleaving derivatives and simplifications applied to them.
 
 \noindent
 %When the function $\textit{Rflts}$ 
-%is applied to the concatenation of two lists, the output can be calculated by first applying the
-%functions on two lists separately, and then concatenating them together.
+%is applied to the concatenation of two lists; the output can be calculated by first applying the
+%functions on two lists separately and then concatenating them together.
 $\textit{Rflts}$ is composable in terms of concatenation:
 \begin{lemma}\label{rfltsProps}
 	The function $\rflts$ has the properties below:\\
@@ -919,10 +916,10 @@
 which evaluates to true when the regular expression is not an
 alternative, and false otherwise.
 The $\good$ property is preserved under $\rsimp_{ALTS}$, provided that
-its non-empty argument list of expressions are all good themsleves, and $\textit{nonAlt}$, 
+its non-empty argument list of expressions are all good themselves, and $\textit{nonAlt}$, 
 and unique:
 \begin{lemma}\label{rsimpaltsGood}
-	If $rs \neq []$ and forall $r \in rs. \textit{nonAlt} \; r$ and $\textit{isDistinct} \; rs$,
+	If $rs \neq []$ and for all $r \in rs. \textit{nonAlt} \; r$ and $\textit{isDistinct} \; rs$,
 	then $\good \; (\rsimpalts \; rs)$ if and only if forall $r \in rs. \; \good \; r$.
 \end{lemma}
 \noindent
@@ -962,7 +959,7 @@
 	By induction on $r$.
 \end{proof}
 \noindent
-With this we can prove that a regular expressions
+With this we can prove that a regular expression
 after simplification and flattening and de-duplication,
 will not contain any alternative regular expression directly:
 \begin{lemma}\label{nonaltFltsRd}
@@ -995,8 +992,8 @@
 	\end{itemize}
 \end{lemma}
 \begin{proof}
-	The first part is by induction on $rs$, where the induction
-	rule is the inductive cases for $\rflts$.
+	The first part is by induction, where the inductive cases
+	are the inductive cases of $\rflts$.
 	The second part is a corollary from the first part.
 \end{proof}
 
@@ -1012,7 +1009,7 @@
 	$\llbracket \rsimp{r}\rrbracket_r$ is smaller than or equal to
 	$\llbracket r \rrbracket_r$.
 	Therefore, in the $r_1 \cdot r_2$ and $\sum rs$ case,
-	Inductive hypothesis applies to the children regular expressions
+	The inductive hypothesis applies to the children regular expressions
 	$r_1$, $r_2$, etc. The lemma \ref{flts3Obv}'s precondition is satisfied
 	by that as well.
 	The lemmas \ref{nonnestedRsimp} and  \ref{nonaltFltsRd} are used
@@ -1101,7 +1098,7 @@
 	\end{itemize}
 \end{corollary}
 \noindent
-This will be useful in the later closed form proof's rewriting steps.
+This will be useful in the later closed-form proof's rewriting steps.
 Similarly, we state the following useful facts below:
 \begin{lemma}
 	The following equalities hold if $r = \rsimp{r'}$ for some $r'$:
@@ -1442,7 +1439,7 @@
 	If $r \hrewrites r' $ then $\rder{c}{r} \hrewrites \rder{c}{r'}$
 \end{lemma}
 \noindent
-This allows us proving more $\mathbf{rsimp}$-equivalent terms, involving $\backslash_r$.
+This allows us to prove more $\mathbf{rsimp}$-equivalent terms, involving $\backslash_r$.
 \begin{lemma}\label{insideSimpRemoval}
 	$\rsimp{(\rder{c}{(\rsimp{r})})} = \rsimp{(\rder{c}{r})}  $
 \end{lemma}
@@ -1530,7 +1527,7 @@
 		$   \longrightarrow_{\backslash c''} \quad \ldots$\\
 	\end{tabular}
 \end{center}
-Roughly speaking $r_1 \cdot r_2 \backslash s$ can be expresssed as 
+Roughly speaking $r_1 \cdot r_2 \backslash s$ can be expressed as 
 a giant alternative taking a list of terms 
 $[r_1 \backslash_r s \cdot r_2, r_2 \backslash_r s'', r_2 \backslash_r s_1'', \ldots]$,
 where the head of the list is always the term
@@ -1600,13 +1597,13 @@
 and only generate the 
 $r_2 \backslash_r s''$ terms satisfying the property
 \begin{center}
-$\exists  s'.  such \; that \; s'@s'' = s \;\; \land \;\;
+$\exists s'.  such \; that \; s'@s'' = s \;\; \land \;\;
 r_1 \backslash s' \; is \; nullable$.
 \end{center}
 Given the arguments $s$ and $r_1$, we denote the list of strings
 $s''$ satisfying the above property as $\vsuf{s}{r_1}$.
 The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string\footnote{
-	Perhaps a better name of it would be ``NullablePrefixSuffix''
+	Perhaps a better name for it would be ``NullablePrefixSuffix''
 	to differentiate with the list of \emph{all} prefixes of $s$, but
 	that is a bit too long for a function name and we are yet to find
 a more concise and easy-to-understand name.}
@@ -1835,7 +1832,7 @@
 the number of terms in $r^* \backslash s$ will grow exponentially rather than linearly
 in the sequence case.
 The good news is that the function $\textit{rsimp}$ will again
-ignore the difference between differently nesting patterns of alternatives,
+ignore the difference between different nesting patterns of alternatives,
 and the exponentially growing star derivative like
 \begin{center}
 	$(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + 
@@ -2020,7 +2017,7 @@
 	holds.
 \end{lemma}
 \begin{proof}
-	By an induction on the inductivev cases of $\textit{createdByStar}$.
+	By an induction on the inductive cases of $\textit{createdByStar}$.
 \end{proof}
 %This is not entirely true for annotated regular expressions: 
 %%TODO: bsimp bders \neq bderssimp
@@ -2090,7 +2087,7 @@
 \begin{proof}
 	By using the rewriting relation $\rightsquigarrow$
 \end{proof}
-And from this we obtain that a 
+And from this we obtain the following fact: a 
 regular expression created by star 
 is the same as its flattened version, up to equivalence under $\textit{bsimp}$.
 For example,
@@ -2238,17 +2235,17 @@
 
 In this section, we introduce how we formalised the bound
 on closed forms.
-We first show that in general regular expressions up to a certain 
-size are finite.
+We first show that in general the number of regular expressions up to a certain 
+size is finite.
 Then we prove that functions such as $\rflts$
 will not cause the size of r-regular expressions to grow.
 Putting this together with a general bound 
 on the finiteness of distinct regular expressions
-up to a certain size, we obtain a bound on 
+up to a specific size, we obtain a bound on 
 the closed forms.
 
 \subsection{Finiteness of Distinct Regular Expressions}
-We define the set of regular expressions whose size are no more than
+We define the set of regular expressions whose size is no more than
 a certain size $N$ as $\textit{sizeNregex} \; N$:
 \[
 	\textit{sizeNregex} \; N \dn \{r\; \mid \;  \llbracket r \rrbracket_r \leq N \}
@@ -2261,7 +2258,7 @@
 	By splitting the set $\textit{sizeNregex} \; (N + 1)$ into
 	subsets by their categories:
 	$\{\ZERO_r, \ONE_r, c\}$, $\{r^* \mid r \in \textit{sizeNregex} \; N\}$,
-	and so on. Each of these subsets are finitely bounded.
+	and so on. Each of these subsets is finitely bounded.
 \end{proof}
 \noindent
 From this we get a corollary that
@@ -2329,7 +2326,7 @@
 	\end{itemize}
 \end{lemma}
 \begin{proof}
-	Point 1, 3, 4 can be proven by an induction on $rs$.
+	Points 1, 3, and 4 can be proven by an induction on $rs$.
 	Point 2 is by case analysis on $r_1$ and $r_2$.
 	The last part is a corollary of the previous ones.
 \end{proof}
@@ -2374,8 +2371,8 @@
 will be solely comprised of $r_1 \backslash s'$ 
 and $r_2 \backslash s''$, $s'$ and $s''$ being 
 sub-strings of $s$).
-which will each have a size uppder bound 
-according to inductive hypothesis, which controls $r \backslash s$.
+which will each have a size upper bound 
+according to the inductive hypothesis, which controls $r \backslash s$.
 
 We elaborate the above reasoning by a series of lemmas
 below, where straightforward proofs are omitted.
@@ -2391,7 +2388,7 @@
 We need this so that we know the outcome of our real 
 simplification is better than or equal to a rough estimate,
 and therefore can be bounded by that estimate.
-This is a bit harder to establish compared with proving
+This is a bit harder to establish compared to proving
 $\textit{flts}$ does not make a list larger (which can
 be proven using routine induction):
 \begin{center}
@@ -2486,7 +2483,7 @@
 	By using corollary \ref{interactionFltsDB}.
 \end{proof}
 \noindent
-This is a key lemma in establishing the bounds on all the 
+This is a key lemma in establishing the bounds of all the 
 closed forms.
 With this we are now ready to control the sizes of
 $(r_1 \cdot r_2 )\backslash s$ and $r^* \backslash s$.
@@ -2641,7 +2638,7 @@
 \end{center}
 \noindent
 Arguably we should use $\log \; n$ for the size because
-the number of digits increase logarithmically w.r.t $n$.
+the number of digits increases logarithmically w.r.t $n$.
 For simplicity we choose to add the counter directly to the size.
 
 The derivative w.r.t a bounded regular expression
@@ -2693,7 +2690,7 @@
 The rewrite rules such as $\rightsquigarrow$, $\stackrel{s}{\rightsquigarrow}$ and so on
 do not need to be changed,
 and only a few lemmas such as lemma \ref{fltsPreserves} need to be adjusted to 
-add one more line which can be solved by sledgehammer 
+add one more line which can be solved by the Sledgehammer tool
 to solve the $r^{\{n\}}$ inductive case.
 
 
@@ -2918,7 +2915,7 @@
 	then $\cbn \; (r\backslash_r c)$ holds.
 \end{proof}
 \noindent
-In addition, for $\cbn$-shaped regular expressioins, one can flatten
+In addition, for $\cbn$-shaped regular expressions, one can flatten
 them:
 \begin{lemma}\label{ntimesHfauPushin}
 	If $\cbn \; r$ holds, then $\hflataux{r \backslash_r c} = 
@@ -2981,7 +2978,7 @@
 \end{center}
 \begin{lemma}\label{nupdatesNonempty}
 	If for all elements $o \in \textit{set} \; Ss$,
-	$\nString \; o$ holds, the we have that
+	$\nString \; o$ holds, then we have that
 	for all elements $o' \in \textit{set} \; (\nupdates \; s \; r \; Ss)$,
 	$\nString \; o'$ holds.
 \end{lemma}
@@ -3109,8 +3106,8 @@
 for constructs like $r^{\{\ldots n\}}$, $r^{\{n \ldots\}}$
 and so on, which is still work in progress.
 They should more or less follow the same recipe described in this section.
-Once we know about how to deal with them recursively using suitable auxiliary
-definitions, we are able to routinely establish the proofs.
+Once we know how to deal with them recursively using suitable auxiliary
+definitions, we can routinely establish the proofs.
 
 
 %----------------------------------------------------------------------------------------
@@ -3160,7 +3157,7 @@
 This means the bound we have will surge up at least
 tower-exponentially with a linear increase of the depth.
 
-One might be quite skeptical about what this non-elementary
+One might be pretty skepticafl about what this non-elementary
 bound can bring us.
 It turns out that the giant bounds are far from being hit.
 Here we have some test data from randomly generated regular expressions:
@@ -3232,7 +3229,7 @@
 	\end{tabular}    
   \caption{Graphs: size change of 3 randomly generated 
   regular expressions $w.r.t.$ input string length. 
-  The x axis represents the length of input.}
+  The x-axis represents the length of the input.}
 \end{figure}  
 \noindent
 Most of the regex's sizes seem to stay within a polynomial bound $w.r.t$ the 
@@ -3259,8 +3256,8 @@
 		The size proof can serve as a starting point for a complexity
 		formalisation.
 		Derivatives are the most important phases of our lexer algorithm.
-		Size properties about derivatives covers the majority of the algorithm
-		and is therefore a good indication of complexity of the entire program.
+		Size properties about derivatives cover the majority of the algorithm
+		and is therefore a good indication of the complexity of the entire program.
 	\item
 		The bound is already a strong indication that catastrophic
 		backtracking is much less likely to occur in our $\blexersimp$
@@ -3280,8 +3277,8 @@
 
 
 
-One might wonder the actual bound rather than the loose bound we gave
-for the convenience of an easier proof.
+One might wonder about the actual bound rather than the loose bound we gave
+for the convenience of a more straightforward proof.
 How much can the regex $r^* \backslash s$ grow? 
 As  earlier graphs have shown,
 %TODO: reference that graph where size grows quickly
@@ -3322,7 +3319,7 @@
 The exponential size is triggered by that the regex
 $\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$
 inside the $(\ldots) ^*$ having exponentially many
-different derivatives, despite those difference being minor.
+different derivatives, despite those differences being minor.
 $(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*\backslash \underbrace{a \ldots a}_{\text{m a's}}$
 will therefore contain the following terms (after flattening out all nested 
 alternatives):
@@ -3408,7 +3405,7 @@
 %	SUBSECTION 1
 %-----------------------------------
 %\subsection{Syntactic Equivalence Under $\simp$}
-%We prove that minor differences can be annhilated
+%We prove that minor differences can be annihilated
 %by $\simp$.
 %For example,
 %\begin{center}

--- a/ChengsongTanPhdThesis/Chapters/Future.tex	Fri Dec 30 17:37:51 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Future.tex	Fri Dec 30 23:41:44 2022 +0000
@@ -1,37 +1,37 @@
 % Chapter Template
 
-\chapter{Future Work and Conclusion} % Main chapter title
+\chapter{Conclusion and Future Work} % Main chapter title
 
 \label{Future} 
 
 In this thesis, in order to solve the ReDoS attacks
 once and for all, we have set out to formalise the correctness
 proof of Sulzmann and Lu's
-lexing algorithm with simplifications \cite{Sulzmann2014}
-in Isabelle/HOL. 
-We have fixed inefficiencies in  the original simplification
+lexing algorithm with aggressive simplifications \cite{Sulzmann2014}.
+We formalised our proofs in the Isabelle/HOL theorem prover.
+We have fixed some inefficiencies and a bug in  their original simplification
 algorithm, and established the correctness by proving
 that our algorithm outputs the same result as the original bit-coded
 lexer without simplifications (whose correctness was established in 
 previous work by Ausaf et al. in
 \cite{Ausaf}  and \cite{AusafDyckhoffUrban2016}).
 The proof technique used in \cite{Ausaf} does not work in our case
-because the simplification function messes with the structure of
+because the simplification function messes with the structure of simplified
 regular expressions.
 Despite having to try out several workarounds and being
-stuck for months or even years looking for proofs, 
+stuck for months looking for proofs, 
 we were delighted to have discovered the simple yet
 effective proof method
 by modelling individual
 simplification steps as small-step rewriting rules and proving
-equivalence between terms linked by these rules.
+equivalence between terms linked by these rewrite rules.
 
 In addition, we have proved a formal size bound on the 
 regular expressions. The technique was by establishing
 a ``closed form'' informally
 described by
 Murugesan and Shanmuga Sundaram \cite{Murugesan2014} 
-for compound regular expressions' derivatives
+for compound derivatives
 and using those closed forms to control the size.
 The Isabelle/HOL code for our formalisation can be 
 found at
@@ -46,29 +46,31 @@
 cubic to the regular expression size using a technique by
 Antimirov\cite{Antimirov95}.
 Once formalised, this would be a guarantee for the absence of all super-linear behavious.
-We are working out the
+We are yet to work out the
 details.
 
+Our formalisation is approximately 7500 lines of code. Slightly more than half of this concerns the finiteness bound obtained in Chapter 5. This slight "bloat" is because we had to repeat many definitions for the rrexp datatype. However, we think we would not have been able to formalise the quite intricate reasoning involving rrexps with annotated regular expressions because we would have to carry around the bit-sequences (that are of course necessary in the algorithm) but which do not contribute to the size bound of the calculated derivatives.
+
 
 To our best knowledge, no lexing libraries using Brzozowski derivatives
 have similar complexity-related bounds, 
-and claims about running time are usually speculative and backed by empirical
-evidence on a few test cases.
+and claims about running time are usually speculative and backed up only by empirical
+evidence on some test cases.
 If a matching or lexing algorithm
-does not come with certain basic complexity related 
+does not come with complexity related 
 guarantees (for examaple the internal data structure size
 does not grow indefinitely), 
-then one cannot claim with confidence having solved the problem
+then one cannot claim with confidence of having solved the problem
 of catastrophic backtracking.
 
 We believe our proof method is not specific to this
 algorithm, and intend to extend this approach
-to prove the correctness of a faster version
-of the lexer. The closed
+to prove the correctness of the faster version
+of the lexer proposed in chapter \cite{Cubic}. The closed
 forms can then be re-used as well. Together
 with the idea from Antimirov \cite{Antimirov95}
-we plan to reduce the size bound to become polynomial
-with respect to the size of the regular expression.
+we plan to reduce the size bound to be just polynomial
+with respect to the size of the regular expressions.
 %We have tried out a variety of other optimisations 
 %to make lexing/parsing with derivatives faster,
 %for example in \cite{Edelmann2021}, \cite{Darragh2020}
@@ -77,12 +79,13 @@
 We have learned quite a few lessons in this process.
 As simple as the end result may seem,
 coming up with the initial proof idea with rewriting relations
-was the hardest for us, as it requires a deep understanding of what information
+was the hardest for us, as it required a deep understanding of what information
 is preserved during the simplification process.
-Having the first formalisation gives us much more confidence----it took
-us much less time to obtain the closed forms and size bound, although they
-are more involved than the correctness proof.
-As having been tested numerous times, 
+There the ideas given by Sulzmann and Lu and Ausaf et al. were of no use.
+%Having the first formalisation gives us much more confidence----it took
+%us much less time to obtain the closed forms and size bound, although they
+%are more involved than the correctness proof.
+Of course this has already been shown many times,
 the usefulness of formal approaches cannot be overstated:
 they not only allow us to find bugs in Sulzmann and Lu's simplification functions,
 but also helped us set up realistic expectations about performance
@@ -90,35 +93,36 @@
 the $\blexersimp$ lexer defined in chapter \ref{Bitcoded2}
 was already able to achieve the linear bound claimed by
 Sulzmann and Lu.
-We then attempted to prove that $\llbracket \blexersimp\; r \; s\rrbracket$ 
+We then attempted to prove that the size $\llbracket \blexersimp\; r \; s\rrbracket$ 
 is $O(\llbracket r\rrbracket^c\cdot |s|)$,
-where $c$ is a small constant number, making
+where $c$ is a small constant, making
 $\llbracket r\rrbracket^c$ a small constant factor.
-We were quite surprised that this did not go through despite a lot of efforts.
+We were then quite surprised that this did not go through despite a lot of effort.
 This led us to discover the example where $\llbracket r\rrbracket^c$
 can in fact be exponentially large in chapter \ref{Finite}.
 We therefore learned the necessity to introduce the stronger simplifications
 in chapter \ref{Cubic}.
-Without formal proofs we could not have found out this so soon.
+Without formal proofs we would not have found out this so soon,
+if at all.
 
 
-In the future, we plan to incorporate many extensions
-to this work.
+%In the future, we plan to incorporate many extensions
+%to this work.
 
 \section{Future Work}
 The most obvious next-step is to implement the cubic bound and
 correctness of $\blexerStrong$
-in chapter \ref{Cubic} formally.
+in chapter \ref{Cubic}.
 A cubic bound ($O(\llbracket r\rrbracket^c\cdot |s|)$) with respect to 
-regular expression size will fulfill
+regular expression size will get us one step closer to fulfilling
 the linear complexity claim made by Sulzmann and Lu.
 
 With a linear size bound theoretically, the next challenge would be
-to generate code that are competitive with respect to commercial
-matchers. For that a lot of optimisations are needed.
+to generate code that is competitive with respect to 
+matchers based on DFAs or NFAs. For that a lot of optimisations are needed.
 We aim to  integrate the zipper data structure into our lexer.
 The idea is very simple: using a zipper with multiple focuses
-just like Darragh \cite{Darragh2020} did, we could represent
+just like Darragh \cite{Darragh2020} did in their parsing algorithm, we could represent
 \[
 	x\cdot r + y \cdot r + \ldots
 \]
@@ -147,29 +151,29 @@
 In this way, the internal data structures can be pruned more aggressively, 
 leading to better simplifications and ultimately faster algorithms.
 
-Traditional automaton approaches can be sped up by compiling
+Traditional automata approaches can be sped up by compiling
 multiple rules into the same automaton. This has been done
 in \cite{Kumar2006} and \cite{Becchi08}, for example.
 Currently our lexer only deals with a single regular expression each time.
-Putting multiple regular expressions might open up more
+Extending this to multiple regular expressions might open up more
 possibilities of simplifications.
 
 As already mentioned in chapter \ref{RelatedWork},
-reducing the number of memory accesses can accelerate the 
-matching speed. It would be interesting to see the memory
-bandwidth of our derivative-based matching and improve accordingly.
+reducing the number of memory accesses can also accelerate the 
+matching speed. It would be interesting to study the memory
+bandwidth of our derivative-based matching algorithm and improve accordingly.
 
 Memoization has been used frequently in lexing and parsing to achieve
 better complexity results, for example in \cite{Reps1998},
 \cite{Verbatimpp}, \cite{Adams2016}, \cite{Darragh2020} and \cite{Edelmann2021}.
 We plan to explore the performance enhancements by memoization in our algorithm
 in a correctness-preserving way.
-The Monadic data refinement technique that Lammich and Tuerk used
-in \cite{Lammich2012} to optimise the Hopcroft automaton minimisation
-algorithm seem quite interesting.
+The monadic data refinement technique that Lammich and Tuerk used
+in \cite{Lammich2012} to optimise Hopcroft's automaton minimisation
+algorithm seems also quite relevant for such an enterprise.
 We aim to learn from their refinement framework
 which generates high performance code with proofs
-that can be broken down into uncluttered small steps.
+that can be broken down into small steps.

--- a/ChengsongTanPhdThesis/Chapters/RelatedWork.tex	Fri Dec 30 17:37:51 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/RelatedWork.tex	Fri Dec 30 23:41:44 2022 +0000
@@ -19,7 +19,7 @@
 It is worth pointing out that this result does not directly
 translate to our
 finiteness proof, 
-and our proof does not depend on it to be valid.
+and our proof does not depend on it.
 The key observation is that our version of the Sulzmann and Lu's algorithm
 \cite{Sulzmann2014} represents 
 derivative terms in a way that allows efficient de-duplication,
@@ -35,12 +35,12 @@
 into a tool called \emph{dreml}.
 The pencil-and-paper proofs in \cite{Sulzmann2014} based on the ideas
 by Frisch and Cardelli \cite{Frisch2004} were later
-found to have unfillable gaps by Ausaf \cite{Ausaf},
+found to have unfillable gaps by Ausaf et al. \cite{AusafDyckhoffUrban2016},
 who came up with an alternative proof inspired by 
 Vansummeren \cite{Vansummeren2006}.
 Sulzmann and Thiemann extended the Brzozowski derivatives to 
 shuffling regular expressions \cite{Sulzmann2015},
-which are a very succinct way of describing basic regular expressions.
+which are a very succinct way of describing regular expressions.
 
 
 
@@ -56,18 +56,20 @@
 of the work by Krauss and Nipkow \parencite{Krauss2011}.  Another one
 in Coq is given by Coquand and Siles \parencite{Coquand2012}.
 Also Ribeiro and Du Bois gave one in Agda \parencite{RibeiroAgda2017}.
-The most recent works to our best knowledge are the Verbatim \cite{Verbatim}
-and Verbatim++ \cite{Verbatim} lexers.
+The most recent works on verified lexers to our best knowledge are the Verbatim \cite{Verbatim}
+and Verbatim++ \cite{Verbatimpp} lexers by Egolf et al.
+Verbatim is based on derivatives but does not simplify them. 
+Therefore it can be very slow on evil regular expressions.
 The Verbatim++ lexer adds many correctness-preserving 
 optimisations to the Verbatim lexer,
 and is therefore quite fast on many inputs.
-The problem is that they chose to use DFA to speed up things,
-for which dealing with bounded repetitions is a bottleneck.
+The problem is that Egolf et al. chose to use traditional DFAs to speed up lexing,
+but then dealing with bounded repetitions is a real bottleneck.
 
 
 This thesis builds on the formal specifications of POSIX
 rules and formal proofs by Ausaf et al. \cite{AusafDyckhoffUrban2016}.
-The bounded repetitions is a continuation of the work by Ausaf \cite{Ausaf}.
+The bounded repetitions presented here is a continuation of the work by Ausaf \cite{Ausaf}.
 
 Automata formalisations are in general harder and more cumbersome to deal
 with for theorem provers \cite{Nipkow1998}.
@@ -79,7 +81,8 @@
 both graphs are distinct. 
 If they are not distinct, then renaming of the nodes is needed.
 Using Coq which provides dependent types can potentially make things slightly easier
-\cite{Doczkal2013}
+\cite{Doczkal2013}.
+
 Another representation for automata are matrices. 
 But the induction for them is not as straightforward either.
 Both approaches have been used in the past and resulted in huge formalisations.
@@ -88,6 +91,9 @@
 There the problem with combining graphs can be solved better. 
 %but we believe that such clever tricks are not very obvious for 
 %the John-average-Isabelle-user.
+Because of these problems with automata, we prefer regular expressions
+and derivatives which can be implemented in theorem provers and functional programming languages
+with ease.
 
 \subsection{Different Definitions of POSIX Rules}
 There are different ways to formalise values and POSIX matching.
@@ -107,37 +113,41 @@
 \cite{AusafDyckhoffUrban2016}) as \emph{canonical values}.
 The very interesting link between the work by Ausaf et al.
 and Okui and Suzuki is that they have distinct formalisations
-of POSIX values, and yet they are provably equivalent! See
+of POSIX values, and yet they define the same notion.  See
 \cite{AusafDyckhoffUrban2016} for details of the
-alternative definitions given by Okui and Suzuki and the formalisation.
+alternative definitions given by Okui and Suzuki and the formalisation
+described in this thesis.
 %TODO: EXPLICITLY STATE the OKUI SUZUKI POSIX DEF.
 
-Sulzmann and Lu themselves have come up with POSIX definitions \cite{Sulzmann2014}.
+Sulzmann and Lu themselves have come up with a POSIX definition \cite{Sulzmann2014}.
 In their paper they defined an ordering between values with respect
 to regular expressions, and tried to establish that their
 algorithm outputs the minimal element by a pencil-and-paper proof.
 But having the ordering relation taking regular expression as parameters
 causes the transitivity of their ordering to not go through.
 
+\subsection{Static Analysis of Evil Regex Patterns} 
+
+
 When faced with catastrophic backtracking, 
 sometimes programmers 
 have to rewrite their regexes in an ad hoc fashion.
 Knowing which patterns should be avoided
 can be helpful.
-\subsection{Static Analysis of Evil Regex Patterns} 
-or they try to avoid the possibly problematic patterns completely,
-for which many false positives exist\parencite{Davis18}.
+Tools for this
+often over-approximate evil regular expression patterns,
+and there can be many false positives \cite{Davis18}.
 Animated tools to "debug" regular expressions such as
  \parencite{regexploit2021} \parencite{regex101} are also popular.
 We are also aware of static analysis work on regular expressions that
 aims to detect potentially expoential regex patterns. Rathnayake and Thielecke 
 \parencite{Rathnayake2014StaticAF} proposed an algorithm
 that detects regular expressions triggering exponential
-behavious on backtracking matchers.
+behavious with backtracking matchers.
 Weideman \parencite{Weideman2017Static} came up with 
 non-linear polynomial worst-time estimates
-for regexes, attack string that exploit the worst-time 
-scenario, and "attack automata" that generates
+for regexes and ``attack string'' that exploit the worst-time 
+scenario, and introduced "attack automaton" that generates
 attack strings.
 
 
@@ -146,14 +156,14 @@
 \section{Optimisations}
 Perhaps the biggest problem that prevents derivative-based lexing 
 from being more widely adopted
-is that they tend to be not blindingly fast in practice, unable to 
+is that they tend to be not very fast in practice, unable to 
 reach throughputs like gigabytes per second, which is the
-application we had in mind when we initially started looking at the topic.
+application we have in mind when we started looking at the topic.
 Commercial
 regular expression matchers such as Snort \cite{Snort1999} and Bro \cite{Bro}
 are capable of inspecting payloads
 at line rates (which can be up to a few gigabits per second) against 
-thousands of rules \cite{communityRules}.
+thousands of regex rules \cite{communityRules}.
 For our algorithm to be more attractive for practical use, we
 need more correctness-preserving optimisations.
 
@@ -161,16 +171,17 @@
 have the advantages of being
 reconfigurable and parallel, but suffer from lower clock frequency
 and scalability.
-Traditional automaton approaches that use a DFA instead of NFA
+Traditional automaton approaches that use DFAs instead of NFAs
 benefit from the fact that only a single transition is needed 
 for each input character \cite{Becchi08}. Lower memory bandwidth leads
 to faster performance.
-However, they suffer from exponential blow-ups on bounded repetitions.
+However, they suffer from exponential memory requirements on bounded repetitions.
 Compression techniques are used, such as those in \cite{Kumar2006} and
 \cite{Becchi2007}.
 Variations of pure NFAs or DFAs like counting-set automata \cite{Turonova2020}
 have been
 proposed to better deal with bounded repetitions.
+But they are usually not used in formalised proofs.
 
 %So far our focus has been mainly on the bit-coded algorithm,
 %but the injection-based lexing algorithm 
@@ -184,46 +195,48 @@
 $((r_1 + r_2)\backslash c)\backslash cs$ (as in definition \ref{table:der}), we rather
 calculate $(r_1\backslash (c::cs) + r_2\backslash (c::cs))$.
 This has the potential to speed up matching because input is 
-processed in larger granularity.
-One interesting thing is to explore whether this can be done
+processed at a larger granularity.
+One interesting point is to explore whether this can be done
 to $\inj$ as well, so that we can generate a lexical value 
 rather than simply get a matcher.
+It is also not yet clear how this idea can be extended to sequences and stars.
 
 
 \subsection{Derivatives and Zippers}
-Zipper is a data structure designed to focus on 
+Zippers are a data structure designed to focus on 
 and navigate between local parts of a tree.
-The idea is that often operations on a large tree only deal with
+The idea is that often operations on large trees only deal with
 local regions each time.
 Therefore it would be a waste to 
 traverse the entire tree if
 the operation only
 involves a small fraction of it.
-It was first formally described by Huet \cite{Huet1997}.
+Zippers were first formally described by Huet \cite{Huet1997}.
 Typical applications of zippers involve text editor buffers
 and proof system databases.
 In our setting, the idea is to compactify the representation
 of derivatives with zippers, thereby making our algorithm faster.
-We introduce several works on parsing, derivatives
+Below we describe several works on parsing, derivatives
 and zippers.
 
 Edelmann et al. developed a formalised parser for LL(1) grammars using derivatives
 \cite{Zippy2020}.
 They adopted zippers to improve the speed, and argued that the runtime
 complexity of their algorithm was linear with respect to the input string.
+They did not provide a formal proof for this.
 
 The idea of using Brzozowski derivatives on general context-free 
 parsing was first implemented
 by Might et al. \cite{Might2011}. 
-They used memoization and fixpoint construction to eliminate infinite recursion,
-which is a major problem for using derivatives on context-free grammars.
-The initial version was quite slow----exponential on the size of the input.
+They used memoization and fixpoint constructions to eliminate infinite recursion,
+which is a major problem for using derivatives with context-free grammars.
+Their initial version was quite slow----exponential on the size of the input.
 Adams et al. \cite{Adams2016} improved the speed and argued that their version
 was cubic with respect to the input.
 Darragh and Adams \cite{Darragh2020} further improved the performance
 by using zippers in an innovative way--their zippers had multiple focuses
 instead of just one in a traditional zipper to handle ambiguity.
-Their algorithm was not formalised, though.
+Their algorithm was not formalised though.
 
 
 
@@ -235,33 +248,33 @@
 With Back References) for brevity.
 It has been shown by Aho \cite{Aho1990}
 that the k-vertex cover problem can be reduced
-to the problem of rewbrs matching, and is therefore NP-complete.
+to the problem of rewbrs matching, and therefore matching with rewbrs is NP-complete.
 Given the depth of the problem, the progress made at the full generality
-of arbitrary rewbrs matching has been slow, with
+of arbitrary rewbrs matching has been slow with
 theoretical work on them being 
 fairly recent. 
 
-Campaneu et al. studied regexes with back-references
+Campaneu et al. studied rewbrs 
 in the context of formal languages theory in 
-their 2003 work \cite{campeanu2003formal}.
+\cite{campeanu2003formal}.
 They devised a pumping lemma for Perl-like regexes, 
 proving that the langugages denoted by them
 is  properly  contained in context-sensitive
 languages.
 More interesting questions such as 
-whether the language denoted by Perl-like regexes
+whether the languages denoted by Perl-like regexes
 can express palindromes ($\{w \mid w = w^R\}$)
 are discussed in \cite{campeanu2009patterns} 
 and \cite{CAMPEANU2009Intersect}.
 Freydenberger \cite{Frey2013} investigated the 
 expressive power of back-references. He showed several 
 undecidability and decriptional complexity results 
-of back-references, concluding that they add
-great power to certain programming tasks but are difficult to harness.
+for back-references, concluding that they add
+great power to certain programming tasks but are difficult to comprehend.
 An interesting question would then be
 whether we can add restrictions to them, 
-so that they become expressive enough for practical use such
-as html files, but not too powerful.
+so that they become expressive enough for practical use, such
+as matching html files, but not too powerful.
 Freydenberger and Schmid \cite{FREYDENBERGER20191}
 introduced the notion of deterministic
 regular expressions with back-references to achieve
@@ -271,12 +284,12 @@
 Fernau and Schmid \cite{FERNAU2015287} and Schmid \cite{Schmid2012}
 investigated the time complexity of different variants
 of back-references.
-We are not aware of any work that uses derivatives on back-references.
+We are not aware of any work that uses derivatives with back-references.
 
-See \cite{BERGLUND2022} for a survey
-of these works and comparison between different
-flavours  of back-references syntax (e.g. whether references can be circular, 
-can labels be repeated etc.).
+%See \cite{BERGLUND2022} for a survey
+%of these works and comparison between different
+%flavours  of back-references syntax (e.g. whether references can be circular, 
+%can labels be repeated etc.).

--- a/thys2/blexer2.sc	Fri Dec 30 17:37:51 2022 +0000
+++ b/thys2/blexer2.sc	Fri Dec 30 23:41:44 2022 +0000
@@ -65,6 +65,7 @@
 
 import scala.util.Try
 
+
 trait Generator[+T] {
   self => // an alias for "this"
     def generate(): T
@@ -1635,6 +1636,14 @@
 
 
 
+  def struct_regex(depth: Int) {
+    if(depth <= 0) {
+      ONE
+    }
+    else {
+      val dice = 
+    }
+  }
 
     def bders_bderssimp() {

author	Chengsong
	Fri, 30 Dec 2022 23:41:44 +0000
changeset 640	bd1354127574
parent 639	80cc6dc4c98b
child 641	cf7a5c863831
child 646	56057198e4f5

ChengsongTanPhdThesis/Chapters/Bitcoded1.tex		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/Chapters/Bitcoded2.tex		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/Chapters/Finite.tex		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/Chapters/Future.tex		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/Chapters/RelatedWork.tex		file \| annotate \| diff \| comparison \| revisions
thys2/blexer2.sc		file \| annotate \| diff \| comparison \| revisions