lexing: changeset 639:80cc6dc4c98b

--- a/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex	Fri Dec 30 01:52:32 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex	Fri Dec 30 17:37:51 2022 +0000
@@ -704,6 +704,7 @@
 \end{tabular}    
 \end{center}    
 \noindent
+For $\bder \; c\; a$, we use the infix notation $a\backslash c$ for readability.
 The $\bder$ function tells us how regular expressions can be recursively traversed,
 where the bitcodes are augmented and carried around 
 when a derivative is taken.

--- a/ChengsongTanPhdThesis/Chapters/Bitcoded2.tex	Fri Dec 30 01:52:32 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Bitcoded2.tex	Fri Dec 30 17:37:51 2022 +0000
@@ -16,7 +16,7 @@
 each intermediate derivative result. This allows
 us to make $\blexer$ much more efficient.
 Sulzmann and Lu already introduced some simplifications for bitcoded regular expressions,
-but their simplification functions  were inefficient and in some cases needs fixing.
+but their simplification functions  were inefficient and in some cases needed fixing.
 %We contrast our simplification function 
 %with Sulzmann and Lu's, indicating the simplicity of our algorithm.
 %This is another case for the usefulness 
@@ -51,7 +51,7 @@
 \end{center}
 because the duplicates are
 not next to each other and therefore the rule
-$r+ r \rightarrow r$ does not fire.
+$r+ r \rightarrow r$ from $\textit{simp}$ does not fire.
 One would expect a better simplification function to work in the 
 following way:
 \begin{gather*}
@@ -94,7 +94,7 @@
 		$_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\
 		$\textit{simp}\_{SL}  \; _{bs}\sum[r]$ & $\dn$ & $\fuse \; bs \; (\textit{simp}\_{SL}  \; r)$\\
 		$\textit{simp}\_{SL}  \; _{bs}\sum(r::rs)$ & $\dn$ & $_{bs}\sum 
-		(\nub \; (\filter \; (\not \circ \zeroable)\;((\textit{simp}\_{SL}  \; r) :: \map \; \textit{simp}\_{SL}  \; rs)))$\\ 
+		(\nub \; (\filter \; (\neg\zeroable)\;((\textit{simp}\_{SL}  \; r) :: \map \; \textit{simp}\_{SL}  \; rs)))$\\ 
 		
 	\end{tabular}
 \end{center}
@@ -163,7 +163,7 @@
 \noindent
 We implemented this lexing algorithm in Scala, 
 and found that the final derivative regular expression
-size grows exponentially fast (note the logarithmic scale):
+size still grows exponentially (note the logarithmic scale):
 \begin{figure}[H]
 	\centering
 \begin{tikzpicture}
@@ -206,17 +206,18 @@
 \end{figure}
 \noindent
 which seems like a counterexample for 
-their linear complexity claim:
+Sulzmann and Lu's linear complexity claim
+in their paper \cite{Sulzmann2014}:
 \begin{quote}\it
-Linear-Time Complexity Claim \\It is easy to see that each call of one of the functions/operations:
-simp, fuse, mkEpsBC and isPhi leads to subcalls whose number is bound by the size of the regular expression involved. We claim that thanks to aggressively applying simp this size remains finite. Hence, we can argue that the above mentioned functions/operations have constant time complexity which implies that we can incrementally compute bit-coded parse trees in linear time in the size of the input. 
+``Linear-Time Complexity Claim \\It is easy to see that each call of one of the functions/operations:
+simp, fuse, mkEpsBC and isPhi leads to subcalls whose number is bound by the size of the regular expression involved. We claim that thanks to aggressively applying simp this size remains finite. Hence, we can argue that the above mentioned functions/operations have constant time complexity which implies that we can incrementally compute bit-coded parse trees in linear time in the size of the input.'' 
 \end{quote}
 \noindent
 The assumption that the size of the regular expressions
 in the algorithm
-would stay below a finite constant is not true, not in the
+would stay below a finite constant is not true, at least not in the
 examples we considered.
-The main reason behind this is that (i) the $\textit{nub}$
+The main reason behind this is that (i) Haskell's $\textit{nub}$
 function requires identical annotations between two 
 annotated regular expressions to qualify as duplicates,
 and therefore cannot simplify cases like $_{SZZ}a^*+_{SZS}a^*$
@@ -232,24 +233,24 @@
 \end{center}
 \noindent
 clause, and therefore is not strong enough to simplify all
-needed parts of the regular expression. Moreover, even if the regular expressions size
-do stay finite, one has to take into account that
-the $\textit{simp}\_{SL}$ function is applied many times
-in each derivative step, and that number is not necessarily
-a constant with respect to the size of the regular expression.
+needed parts of the regular expression. Moreover,
+the $\textit{simp}\_{SL}$ function is applied repeatedly
+in each derivative step until a fixed point is reached, 
+which makes the algorithm even more
+unpredictable and inefficient.
 %To not get ``caught off guard'' by
 %these counterexamples,
 %one needs to be more careful when designing the
 %simplification function and making claims about them.
 
 \section{Our $\textit{Simp}$ Function}
-We will now introduce our simplification function.
+We will now introduce our own simplification function.
 %by making a contrast with $\textit{simp}\_{SL}$.
 We also describe
 the ideas behind Sulzmann and Lu's $\textit{simp}\_{SL}$
 algorithm 
-and why it fails to achieve the desired effect. 
-Our simplification function comes with a formal
+and why it fails to achieve the desired effect of keeping the sizes of derivatives finitely bounded. 
+In addition, our simplification function will come with a formal
 correctness proof.
 \subsection{Flattening Nested Alternatives}
 The idea behind the clause
@@ -282,7 +283,7 @@
 	$((a^* a^*)+\underline{(a^* + a^*)})\cdot (a^*a^*)^*+
 ((a^*a^*)+a^*)\cdot (a^*a^*)^*$
 \end{center}
-due to the underlined part not being in the head 
+due to the underlined part not being the head 
 of the alternative.
 
 We define our flatten operation so that it flattens 
@@ -400,7 +401,7 @@
 \noindent
 The simplification (named $\textit{bsimp}$ for \emph{b}it-coded) 
 does a pattern matching on the regular expression.
-When it detected that the regular expression is an alternative or
+When it detects that the regular expression is an alternative or
 sequence, it will try to simplify its children regular expressions
 recursively and then see if one of the children turns into $\ZERO$ or
 $\ONE$, which might trigger further simplification at the current level.
@@ -459,15 +460,15 @@
 \begin{center}
 \begin{tabular}{lcl}
   $\textit{blexer\_simp}\;r\,s$ & $\dn$ &
-      $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\                
+      $\textit{let}\;a = (r^\uparrow)\backslash_{bsimp}\, s\;\textit{in}$\\                
   & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
   & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
   & & $\;\;\textit{else}\;\textit{None}$
 \end{tabular}
 \end{center}
-
 \noindent
-This algorithm keeps the regular expression size small.
+This algorithm keeps the regular expression size small, 
+as we shall demonstrate with some examples in the next section.
 
 
 \subsection{Examples $(a+aa)^*$ and $(a^*\cdot a^*)^*$
@@ -510,12 +511,12 @@
 \end{tikzpicture} 
 \end{tabular}
 \end{center}
-\caption{Our Improvement over Sulzmann and Lu's in terms of size}
+\caption{Our Improvement over Sulzmann and Lu's in terms of size of the derivatives.}
 \end{figure}
 \noindent
 Given the size difference, it is not
 surprising that our $\blexersimp$ significantly outperforms
-$\textit{blexer\_SLSimp}$.
+$\textit{blexer\_SLSimp}$ by Sulzmann and Lu.
 In the next section we are going to establish that our
 simplification preserves the correctness of the algorithm.
 %----------------------------------------------------------------------------------------
@@ -621,7 +622,8 @@
 such that one regular expression
 in the left-hand-side list is rewritable in one step
 to the right-hand-side's regular expression at the same position.
-This helps with defining the ``context rule'' $AL$.\\
+This helps with defining the ``context rule'' $AL$.
+
 The reflexive transitive closure of $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$
 are defined in the usual way:
 \begin{figure}[H]
@@ -775,7 +777,8 @@
 extract the same bitcodes using $\bmkeps$ from the nullable
 components of two regular expressions $r$ and $r'$,
 if we can rewrite from one to the other in finitely
-many steps.\\
+many steps.
+
 For convenience, 
 we define a predicate for a list of regular expressions
 having at least one nullable regular expressions:
@@ -850,7 +853,8 @@
 which says that our simplification's helper functions 
 such as $\distinctBy$ and $\flts$ describe
 reducts of $\stackrel{s*}{\rightsquigarrow}$ and 
-$\rightsquigarrow^* $.\\
+$\rightsquigarrow^* $.
+
 The first lemma to prove is a more general version of 
 $rs_ 1 \rightsquigarrow^* \distinctBy \; rs_1 \; \phi$:
 \begin{lemma}
@@ -972,7 +976,7 @@
 	derivative regular expressions to the 
 	lexer with simplification applied (by lemma \ref{bderBderssimp}):
 	\begin{center}
-		$a \backslash s \stackrel{*}{\rightsquigarrow} \bderssimp{a}{s} $.
+		$a \backslash s \rightsquigarrow^* \bderssimp{a}{s} $.
 	\end{center}
 	We know that they generate the same bits, if the lexing result is a match:
 	\begin{center}
@@ -998,8 +1002,8 @@
 	$\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer\;
 	r\;s = \None$.
 \end{center}
-and obtain the corollary that the bit-coded lexer with simplification is
-indeed correctly outputting POSIX lexing result, if such a result exists.
+and obtain the property that the bit-coded lexer with simplification is
+indeed correctly generating a POSIX lexing result, if such a result exists.
 \begin{corollary}
 	$(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp \; r\; s = \Some \; v$\\
 	$\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp\;
@@ -1015,17 +1019,15 @@
 that the annotated regular expressions stay unsimplified, 
 so that one can 
 correctly compare $v_{i+1}$ and $r_i$  and $v_i$ 
-in diagram \ref{graph:inj} and 
-``fit the key into the lock hole''.
+in diagram \ref{graph:inj}.
 
-\noindent
 We also tried to prove 
 \begin{center}
 $\textit{bsimp} \;\; (\bderssimp{a}{s}) = 
 \textit{bsimp} \;\;  (a\backslash s)$,
 \end{center}
 but this turns out to be not true.
-A counterexample would be
+A counterexample is
 \[ a = [(_{Z}1+_{S}c)\cdot [bb \cdot (_{Z}1+_{S}c)]] \;\; 
 	\text{and} \;\; s = bb.
 \]
@@ -1046,7 +1048,8 @@
 we will always have this discrepancy. 
 This is due to 
 the $\map \; (\fuse\; bs) \; as$ operation 
-happening at different locations in the regular expression.\\
+happening at different locations in the regular expression.
+
 The rewriting relation 
 $\rightsquigarrow^*$ 
 allows us to ignore this discrepancy
@@ -1058,7 +1061,8 @@
 
 \end{center}
 as equal, because they were both re-written
-from the same expression.\\
+from the same expression.
+
 The simplification rewriting rules
 given in \ref{rrewriteRules} are by no means
 final,

--- a/ChengsongTanPhdThesis/Chapters/Cubic.tex	Fri Dec 30 01:52:32 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Cubic.tex	Fri Dec 30 17:37:51 2022 +0000
@@ -27,14 +27,13 @@
 size bound from the previous chapter can be 
 improved to a cubic bound.
 We implemented our conjecture in Scala.
-We intend to formalise this part in Isabelle/HOL at a
-later stage.
+We have not formalised this part in Isabelle/HOL. 
 %we have not been able to finish due to time constraints of the PhD.
 Nevertheless, we outline the ideas we intend to use for the proof.
 
 \section{A Stronger Version of Simplification}
 
-We present further improvements
+Let us first present further improvements
 for our lexer algorithm $\blexersimp$.
 We devise a stronger simplification algorithm,
 called $\bsimpStrong$, which can prune away
@@ -70,7 +69,8 @@
 	aa \cdot a^*+ a \cdot a^* + aa\cdot a^*
 \]
 contains three terms, 
-expressing three possibilities for how it can match some input.
+expressing three possibilities for how it can match some more
+input of the form $a \ldots a$.
 The first and the third terms are identical, which means we can eliminate
 the latter as it will not contribute to a POSIX value.
 In $\bsimps$, the $\distinctBy$ function takes care of 
@@ -302,7 +302,11 @@
 
 \end{lstlisting}
 \caption{The helper functions of $\textit{prune}$: 
-  $\textit{atMostEmpty}$, $\textit{isOne}$ and $\textit{removeSeqTail}$}
+$\textit{atMostEmpty}$, $\textit{isOne}$ and $\textit{removeSeqTail}$. $\textit{atMostEmpty}
+$ is a function that takes a regular expression and returns true only in case that it
+contains nothing more than the empty string in its language. $\textit{isOne}$ tests whether
+a regular expression is equivalent to $\ONE$. $\textit{removeSeqTail}$ is a function that
+takes away the tail of a sequence regular expression.}
 \end{figure}
 \noindent
 Suppose we feed 
@@ -388,7 +392,9 @@
         }
 
 \end{lstlisting}
-\caption{A Stronger Version of $\textit{distinctBy}$ XXX}
+\caption{A Stronger Version of $\textit{distinctBy}$. This function allows ``partial de-duplication''
+of a regular expression. When part of a regular expression has appeared before in the accumulator, we 
+can remove that verbose component.}
 \end{figure}
 \noindent
 Once a regular expression has been pruned,
@@ -437,9 +443,9 @@
 $\textit{prune}$ refining the finiteness bound
 to a cubic bound has not been formalised yet.
 Therefore we choose to use Scala code rather than an Isabelle-style formal 
-definition like we did for $\simp$, as the definitions might change
+definition like we did for $\textit{bsimp}$, as the definitions might change
 to suit our proof needs.
-In the rest of the chapter we will use this convention consistently.
+%In the rest of the chapter we will use this convention consistently.
 
 %The function $\textit{prune}$ is used in $\distinctWith$.
 %$\distinctWith$ is a stronger version of $\distinctBy$
@@ -669,7 +675,7 @@
 in our derivative is also only $n * (n + 1) / 2$.
 Therefore
 we conjecture that $\bsimpStrong$ is also able to achieve this
-upper limit in general
+upper limit in general.
 \begin{conjecture}\label{bsimpStrongInclusionPder}
 	Using a suitable transformation $f$, we have that
 	\begin{center}

--- a/ChengsongTanPhdThesis/Chapters/Finite.tex	Fri Dec 30 01:52:32 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Finite.tex	Fri Dec 30 17:37:51 2022 +0000
@@ -39,11 +39,11 @@
 		Having the finite bound formalised 
 		gives us a higher confidence that
 		our simplification algorithm $\simp$ does not ``mis-behave''
-		like $\simpsulz$ does.
+		like $\textit{simpSL}$ does.
 		The bound is universal for a given regular expression, 
 		which is an advantage over work which 
 		only gives empirical evidence on 
-		some test cases (see Verbatim++ work\cite{Verbatimpp}).
+		some test cases (see for example Verbatim work \cite{Verbatimpp}).
 \end{itemize}
 \noindent
 We then extend our $\blexersimp$
@@ -51,19 +51,19 @@
 We update our formalisation of 
 the correctness and finiteness properties to
 include this new construct. 
-we can out-compete other verified lexers such as
+We show that we can out-compete other verified lexers such as
 Verbatim++ on bounded regular expressions.
 
 In the next section we describe in more detail
 what the finite bound means in our algorithm
 and why the size of the internal data structures of
 a typical derivative-based lexer such as
-Sulzmann and Lu's need formal treatment.
+Sulzmann and Lu's needs formal treatment.
 
 
 
 
-\section{Formalising About Size}
+\section{Formalising Size Bound of Derivatives}
 \noindent
 In our lexer ($\blexersimp$),
 we take an annotated regular expression as input,
@@ -112,11 +112,11 @@
 \noindent
 Each time
 a derivative is taken, the regular expression might grow.
-However, the simplification that is immediately afterwards will always shrink it so that 
-it stays relatively small.
+However, the simplification that is immediately afterwards will often shrink it so that 
+the overall size of the derivatives stays relatively small.
 This intuition is depicted by the relative size
 change between the black and blue nodes:
-After $\simp$ the node always shrinks.
+After $\simp$ the node shrinks.
 Our proof states that all the blue nodes
 stay below a size bound $N_a$ determined by the input $a$.
 
@@ -154,7 +154,7 @@
 	\caption{Regular expression size change during our $\blexersimp$ algorithm}\label{sulzShrinks}
 \end{figure}
 \noindent
-The picture means that on certain cases their lexer (where they use $\simpsulz$ 
+The picture means that in some cases their lexer (where they use $\simpsulz$ 
 as the simplification function)
 will have a size explosion, causing the running time 
 of each derivative step to grow continuously (for example 
@@ -163,12 +163,10 @@
 lexer on particular examples such as $(a+b+ab)^*$
 and claimed that their algorithm is linear w.r.t to the input.
 With our mecahnised proof, we avoid this type of unintentional
-generalisation.\\
+generalisation.
 
 Before delving in to the details of the formalisation,
-we are going to provide an overview of it.
-In the next subsection, we give a high-level view
-of the proof.
+we are going to provide an overview of it in the next subsection.
 
 
 \subsection{Overview of the Proof}
@@ -235,7 +233,7 @@
 The first step is to define 
 $\textit{rrexp}$s.
 They are annotated regular expressions without bitcodes,
-allowing a much simpler size bound proof.
+allowing a more convenient size bound proof.
 %Of course, the bits which encode the lexing information 
 %would grow linearly with respect 
 %to the input, which should be taken into account when we wish to tackle the runtime comlexity.
@@ -248,6 +246,7 @@
 The reason for the prefix $r$ is
 to make a distinction  
 with basic regular expressions.
+We give here again the definition of $\rrexp$.
 \[			\rrexp ::=   \RZERO \mid  \RONE
 	\mid  \RCHAR{c}  
 	\mid  \RSEQ{r_1}{r_2}
@@ -308,7 +307,7 @@
 	\end{tabular}
 \end{center}
 \noindent
-As can be seen alternative regular expression with an empty argument list
+As can be seen, alternative regular expressions with an empty argument list
 will be turned into a $\ZERO$.
 The singleton alternative $\sum [r]$ becomes $r$ during the
 $\erase$ function.
@@ -316,7 +315,8 @@
 $(a+(b+c))$.
 All these operations change the size and structure of
 an annotated regular expressions, adding unnecessary 
-complexities to the size bound proof.\\
+complexities to the size bound proof.
+
 For example, if we define the size of a basic plain regular expression 
 in the usual way,
 \begin{center}
@@ -353,8 +353,10 @@
 but we found our approach more straightforward.\\
 
 \subsection{Functions for R-regular Expressions}
+The downside of our approach is that we need to redefine
+several functions for $\rrexp$.
 In this section we shall define the r-regular expression version
-of $\blexer$, and $\blexersimp$ related functions.
+of $\bder$, and $\textit{bsimp}$ related functions.
 We use $r$ as the prefix or subscript to differentiate
 with the bitcoded version.
 %For example,$\backslash_r$, $\rdistincts$, and $\rsimp$
@@ -384,8 +386,8 @@
 \end{center}  
 \noindent
 where we omit the definition of $\textit{rnullable}$.
-The generalisation from derivative w.r.t to character to
-derivative w.r.t string is given as
+The generalisation from the derivatives w.r.t a character to
+derivatives w.r.t strings is given as
 \begin{center}
 	\begin{tabular}{lcl}
 		$r \backslash_{rs} []$ & $\dn$ & $r$\\
@@ -587,7 +589,7 @@
 For example, for $r_1 \cdot r_2$ we have the equality as
 	\begin{center}
 		$ \rderssimp{r_1 \cdot r_2}{s} = 
-		\rsimp{(\sum (r_1 \backslash s \cdot r_2 ) \; :: \;(\map \; \rderssimp{r2}{\_} \;(\vsuf{s}{r_1})))}$
+		\rsimp{(\sum (r_1 \backslash s \cdot r_2 ) \; :: \;(\map \; \rderssimp{r_2}{\_} \;(\vsuf{s}{r_1})))}$
 	\end{center}
 We call the right-hand-side the 
 \emph{Closed Form} of $(r_1 \cdot r_2)\backslash_{rsimps} s$.
@@ -667,7 +669,7 @@
 			$\textit{isDistinct} \;\;\; (\rdistinct{rs}{acc})$.
 		\item
 			$\textit{set} \; (\rdistinct{rs}{acc}) 
-			= (textit{set} \; rs) - acc$
+			= (\textit{set} \; rs) - acc$
 	\end{itemize}
 \end{lemma}
 \noindent
@@ -772,8 +774,8 @@
 	By \ref{rdistinctDoesTheJob} and \ref{distinctRemovesMiddle}.
 \end{proof}
 \noindent
-The next lemma is a more general form of \ref{rdistinctConcat},
-it says that
+The next lemma is a more general form of \ref{rdistinctConcat};
+It says that
 $\textit{rdistinct}$ is composable w.r.t list concatenation:
 \begin{lemma}\label{distinctRdistinctAppend}
 			If $\;\; \textit{isDistinct} \; rs_1$, 
@@ -791,8 +793,8 @@
 $\textit{rdistinct}$ needs to be applied only once, and 
 applying it multiple times does not make any difference:
 \begin{corollary}\label{distinctOnceEnough}
-	$\textit{rdistinct} \; (rs @ rsa) {} = \textit{rdistinct} \; (rdistinct \; 
-	rs \{ \} @ (\textit{rdistinct} \; rs_a \; (set \; rs)))$
+	$\textit{rdistinct} \; (rs @ rsa) {} = \textit{rdistinct} \; ( (rdistinct \; 
+	rs \; \{ \}) @ (\textit{rdistinct} \; rs_a \; (set \; rs)))$
 \end{corollary}
 \begin{proof}
 	By lemma \ref{distinctRdistinctAppend}.
@@ -845,7 +847,7 @@
 \noindent
 Now we introduce the property that the operations 
 derivative and $\rsimpalts$
-commute, this will be used later on, when deriving the closed form for
+commute, this will be used later on when deriving the closed form for
 the alternative regular expression:
 \begin{lemma}\label{rderRsimpAltsCommute}
 	$\rder{x}{(\rsimpalts \; rs)} = \rsimpalts \; (\map \; (\rder{x}{\_}) \; rs)$
@@ -855,13 +857,13 @@
 \end{proof}
 \noindent
 
-\subsubsection{The $RL$ Function: Language Interpretation of $\textit{Rrexp}$s}
+\subsubsection{The $RL$ Function: Language Interpretation for $\textit{Rrexp}$s}
 Much like the definition of $L$ on plain regular expressions, one can also 
-define the language interpretation of $\rrexp$s.
+define the language interpretation for $\rrexp$s.
 \begin{center}
 	\begin{tabular}{lcl}
-		$RL \; (\ZERO)$ & $\dn$ & $\phi$\\
-		$RL \; (\ONE)$ & $\dn$ & $\{[]\}$\\
+		$RL \; (\ZERO_r)$ & $\dn$ & $\phi$\\
+		$RL \; (\ONE_r)$ & $\dn$ & $\{[]\}$\\
 		$RL \; (c)$ & $\dn$ & $\{[c]\}$\\
 		$RL \; \sum rs$ & $\dn$ & $ \bigcup_{r \in rs} (RL \; r)$\\
 		$RL \; (r_1 \cdot r_2)$ & $\dn$ & $ RL \; (r_1) @ RL \; (r_2)$\\
@@ -960,7 +962,7 @@
 	By induction on $r$.
 \end{proof}
 \noindent
-With this we could prove that a regular expressions
+With this we can prove that a regular expressions
 after simplification and flattening and de-duplication,
 will not contain any alternative regular expression directly:
 \begin{lemma}\label{nonaltFltsRd}
@@ -998,7 +1000,7 @@
 	The second part is a corollary from the first part.
 \end{proof}
 
-And this leads to good structural property of $\rsimp{}$,
+This leads to good structural property of $\rsimp{}$,
 that after simplification, a regular expression is
 either good or $\RZERO$:
 \begin{lemma}\label{good1}
@@ -1017,7 +1019,7 @@
 	to ensure that goodness is preserved at the topmost level.
 \end{proof}
 We shall prove that any good regular expression is 
-a fixed-point for $\rsimp{}$.
+a fixed-point for $\textit{rsimp}$.
 First we prove an auxiliary lemma:
 \begin{lemma}\label{goodaltsNonalt}
 	If $\good \; \sum rs$, then $\rflts\; rs = rs$.
@@ -1039,7 +1041,7 @@
 	case where 2 or more elements are present in the list.
 \end{proof}
 \noindent
-Given below is a property involving $\rflts$, $\textit{rdistinct}$, 
+Below we show a property involving $\rflts$, $\textit{rdistinct}$, 
 $\rsimp{}$ and $\rsimp_{ALTS}$,
 which requires $\ref{good1}$ to go through smoothly:
 \begin{lemma}\label{flattenRsimpalts}
@@ -1081,11 +1083,13 @@
 This property means we do not have to repeatedly
 apply simplification in each step, which justifies
 our definition of $\blexersimp$.
+This is in contrast to the work of Sulzmann and Lu where
+the simplification is applied in a fixpoint manner.
 
 
 On the other hand, we can repeat the same $\rsimp{}$ applications
 on regular expressions as many times as we want, if we have at least
-one simplification applied to it, and apply it wherever we would like to:
+one simplification applied to it, and apply it wherever we need to:
 \begin{corollary}\label{headOneMoreSimp}
 	The following properties hold, directly from \ref{rsimpIdem}:
 
@@ -1115,18 +1119,18 @@
 \end{proof}
 
 \noindent
-With the idempotency of $\rsimp{}$ and its corollaries, 
+With the idempotency of $\textit{rsimp}$ and its corollaries, 
 we can start proving some key equalities leading to the 
 closed forms.
-Now presented are a few equivalent terms under $\rsimp{}$.
+Next we present a few equivalent terms under $\textit{rsimp}$.
 To make the notation more concise
-We use $r_1 \sequal r_2 $ to denote that $\rsimp{r_1} = \rsimp{r_2}$:
-\begin{center}
-\begin{tabular}{lcl}
-	$a \sequal b$ & $ \dn$ & $ \textit{rsimp} \; a = \textit{rsimp} \; b$
-\end{tabular}
-\end{center}
-\noindent
+We use $r_1 \sequal r_2 $ to denote that $\rsimp{r_1} = \rsimp{r_2}$.
+%\begin{center}
+%\begin{tabular}{lcl}
+%	$a \sequal b$ & $ \dn$ & $ \textit{rsimp} \; a = \textit{rsimp} \; b$
+%\end{tabular}
+%\end{center}
+%\noindent
 %\vspace{0em}
 \begin{lemma}
 	The following equivalence hold:
@@ -1140,7 +1144,7 @@
 	\item
 		$\sum ((\sum rs_a) :: rs_b) \sequal \sum rs_a @ rs_b$
 	\item
-		$\RALTS{rs} = \RALTS{\map \; \rsimp{} \; rs}$
+		$\RALTS{rs} \sequal \RALTS{\map \; \rsimp{} \; rs}$
 \end{itemize}
 \end{lemma}
 \begin{proof}
@@ -1300,7 +1304,7 @@
 We define
 two separate list rewriting relations $\frewrite$ and $\grewrite$.
 The rewriting steps that take place during
-flattening is characterised by $\frewrite$.
+flattening are characterised by $\frewrite$.
 The rewrite relation $\grewrite$ characterises both flattening and de-duplicating.
 Sometimes $\grewrites$ is slightly too powerful
 so we would rather use $\frewrites$ to prove
@@ -1331,7 +1335,7 @@
 These two lemmas can both be proven using a straightforward induction (and
 the proofs for them are therefore omitted).
 
-Now the above equalities can be derived like a breeze:
+Now the above equalities can be derived with ease: 
 \begin{corollary}
 	$\sum (\rDistinct \;\; (\map \; (\_ \backslash x) \; (\rflts \; rs)) \;\; \varnothing) \sequal
 	\sum (\rDistinct \;\;  (\rflts \; (\map \; (\_ \backslash x) \; rs)) \;\; \varnothing)
@@ -1367,7 +1371,7 @@
 \subsubsection{Terms That Can Be Rewritten Using $\hrewrites$, $\grewrites$, and $\frewrites$}
 In this part, we present lemmas stating
 pairs of r-regular expressions and r-regular expression lists
-where one could rewrite from one in many steps to the other.
+where one can rewrite from one in many steps to the other.
 Most of the proofs to these lemmas are straightforward, using
 an induction on the corresponding rewriting relations.
 These proofs will therefore be omitted when this is the case.
@@ -1424,15 +1428,15 @@
 \noindent
 Now comes the core of the proof, 
 which says that once two lists are rewritable to each other,
-then they are equivalent under $\rsimp{}$:
+then they are equivalent under $\textit{rsimp}$:
 \begin{lemma}
 	If $r_1 \hrewrites r_2$ then $r_1 \sequal r_2$.
 \end{lemma}
 
 \noindent
-Similar to what we did in \ref{Bitcoded2}, 
+Similar to what we did in chapter \ref{Bitcoded2}, 
 we prove that if one can rewrite from one r-regular expression ($r$)
-to the other ($r'$), after taking derivatives one could still rewrite
+to the other ($r'$), after taking derivatives one can still rewrite
 the first ($r\backslash c$) to the other ($r'\backslash c$).
 \begin{lemma}\label{interleave}
 	If $r \hrewrites r' $ then $\rder{c}{r} \hrewrites \rder{c}{r'}$
@@ -1556,11 +1560,11 @@
 The $\delta$ function 
 returns $r$ when the boolean condition
 $b$ evaluates to true and
-$\ZERO$ otherwise:
+$\ZERO_r$ otherwise:
 \begin{center}
 	\begin{tabular}{lcl}
 		$\delta \; b\; r$ & $\dn$ & $r \quad \textit{if} \; b \; is \;\textit{true}$\\
-				  & $\dn$ & $\ZERO \quad otherwise$
+				  & $\dn$ & $\ZERO_r \quad otherwise$
 	\end{tabular}
 \end{center}
 \noindent
@@ -1587,9 +1591,9 @@
 \]
 instead of
 \[
-	(r_1 \backslash_r c_1c_2 + \ZERO ) + \ZERO.
+	(r_1 \backslash_r c_1c_2 + \ZERO_r ) + \ZERO_r.
 \]
-The redundant $\ZERO$s will not be created in the
+The redundant $\ZERO_r$s will not be created in the
 first place.
 In a closed-form one needs to take into account this (because
 closed forms require exact equality rather than language equivalence)
@@ -1629,7 +1633,7 @@
 With $\textit{Suffix}$ we are ready to express the
 sequence regular expression's closed form,
 but before doing so 
-more devices are needed.
+more definitions are needed.
 The first thing is the flattening function $\sflat{\_}$,
 which takes an alternative regular expression and produces a flattened version
 of that alternative regular expression.
@@ -1664,7 +1668,7 @@
 	\begin{tabular}{lcl}
 		$\sflataux{[]}'$ & $ \dn $ & $ []$\\
 		$\sflataux{ (r_1 + r_2) :: rs }'$ & $\dn$ & $r_1 :: r_2 :: rs$\\
-		$\sflataux{r :: rs}$ & $\dn$ & $ r::rs$
+		$\sflataux{r :: rs}'$ & $\dn$ & $ r::rs$
 	\end{tabular}
 \end{center}
 \noindent
@@ -1703,9 +1707,9 @@
 	and $xs  @ [x]$.
 \end{proof}
 \noindent
-If we have a regular expression $r$ whose shpae 
+If we have a regular expression $r$ whose shape 
 fits into those described by $\textit{createdBySequence}$,
-then we can convert between 
+then we can convert between
 $r \backslash_r c$ and $(\sflataux{r}) \backslash_r c$ with
 $\sflataux{\_}'$:
 \begin{lemma}\label{sfauIdemDer}
@@ -1725,7 +1729,7 @@
 the shape of $r_1 \cdot r_2 \backslash s$
 \begin{lemma}\label{seqSfau0}
 	$\sflataux{(r_1 \cdot r_2) \backslash_r s} = (r_1 \backslash_r s) \cdot r_2 
-	:: (\map \; (r_2 \backslash_r \_) \; (\textit{Suffix} \; s \; r1))$ 
+	:: (\map \; (r_2 \backslash_r \_) \; (\textit{Suffix} \; s \; r_1))$ 
 \end{lemma}
 \begin{proof}
 	By a reverse induction on the string $s$, where the inductive cases 
@@ -1755,7 +1759,7 @@
 
 We now use $\textit{createdBySequence}$ and
 $\sflataux{\_}$ to describe an intuition
-behind the closed form of the sequence closed form.
+behind the sequence closed form.
 If two regular expressions only differ in the way their
 alternatives are nested, then we should be able to get the same result
 once we apply simplification to both of them:
@@ -1786,7 +1790,7 @@
 \begin{proof}
 	We know that 
 	$\sflataux{(r_1 \cdot r_2) \backslash_r s} = (r_1 \backslash_r s) \cdot r_2 
-	:: (\map \; (r_2 \backslash_r \_) \; (\textit{Suffix} \; s \; r1))$
+	:: (\map \; (r_2 \backslash_r \_) \; (\textit{Suffix} \; s \; r_1))$
 	holds
 	by lemma \ref{seqSfau0}.
 	This allows the theorem to go through because of lemma \ref{sflatRsimpeq}.
@@ -1845,8 +1849,8 @@
 which can be de-duplicated by $\rDistinct$
 and therefore bounded finitely.
 
-and then de-duplicate terms of the form  ($s'$ being a substring of $s$).
-This allows us to use a similar technique as $r_1 \cdot r_2$ case,
+%and then de-duplicate terms of the form  ($s'$ being a substring of $s$).
+%This allows us to use a similar technique as $r_1 \cdot r_2$ case,
 
 Now the crux of this section is finding a suitable description
 for $rs$ where
@@ -2068,7 +2072,7 @@
 \end{proof}
 \noindent
 
-$\hflataux{\_}$ has a similar effect as $\textit{flatten}$:
+The function $\hflataux{\_}$ has a similar effect as $\textit{flatten}$:
 \begin{lemma}\label{hflatauxGrewrites}
 	$a :: rs \grewrites \hflataux{a} @ rs$
 \end{lemma}
@@ -2086,11 +2090,12 @@
 \begin{proof}
 	By using the rewriting relation $\rightsquigarrow$
 \end{proof}
-And from this we obtain a proof that a star's derivative will be the same
-as if it had all its nested alternatives created during deriving being flattened out:
+And from this we obtain that a 
+regular expression created by star 
+is the same as its flattened version, up to equivalence under $\textit{bsimp}$.
 For example,
 \begin{lemma}\label{hfauRsimpeq2}
-	$\createdByStar{r} \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$
+	$\textit{createdByStar} \; r \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$
 \end{lemma}
 \begin{proof}
 	By structural induction on $r$, where the induction rules 
@@ -2255,7 +2260,7 @@
 \begin{proof}
 	By splitting the set $\textit{sizeNregex} \; (N + 1)$ into
 	subsets by their categories:
-	$\{\ZERO, \ONE, c\}$, $\{* `` \textit{sizeNregex} \; N\}$,
+	$\{\ZERO_r, \ONE_r, c\}$, $\{r^* \mid r \in \textit{sizeNregex} \; N\}$,
 	and so on. Each of these subsets are finitely bounded.
 \end{proof}
 \noindent
@@ -2283,7 +2288,7 @@
 %provided that each of $rs'$'s element
 %is bounded by $N$.
 
-\subsection{$\textit{rsimp}$ Does Not Increment the Size}
+\subsection{$\textit{rsimp}$ Does Not Increase the Size}
 Although it seems evident, we need a series
 of non-trivial lemmas to establish that functions such as $\rflts$
 do not cause the regular expressions to grow.
@@ -2340,7 +2345,7 @@
 
 \subsection{Estimating the Closed Forms' sizes}
 We recap the closed forms we obtained
-earlier by putting them together down below:
+earlier:
 \begin{itemize}
 	\item
 		$\rderssimp{(\sum rs)}{s} \sequal
@@ -2374,15 +2379,10 @@
 
 We elaborate the above reasoning by a series of lemmas
 below, where straightforward proofs are omitted.
-
-
-
-
-We want to apply it to our setting $\rsize{\rsimp{\sum rs}}$.
-
-We show that $\rdistinct$ and $\rflts$
+%We want to apply it to our setting $\rsize{\rsimp{\sum rs}}$.
+We show that $\textit{rdistinct}$ and $\rflts$
 working together is at least as 
-good as $\rdistinct{}{}$ alone, which can be written as
+good as $\textit{rdistinct}$ alone, which can be written as
 \begin{center}
 	$\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r 
 	\leq 
@@ -2443,7 +2443,7 @@
 	$\llbracket  (\textit{rdistinct} \; (\textit{rflts} \; as) \;
 	(noalts\_set \cup corr\_set)) \rrbracket_r$ & $\leq$ &\\
 						    $\llbracket  (\textit{rdistinct} \; as \; (noalts\_set \cup alts\_set \cup
-	\{ \ZERO \} )) \rrbracket_r$ & & \\ 
+	\{ \ZERO_r \} )) \rrbracket_r$ & & \\ 
 	\end{tabular}
 	\end{center}
 		holds.
@@ -2555,7 +2555,7 @@
 (5) is by theorem \ref{starClosedForm}.
 (6) is by \ref{altsSimpControl}.
 (7) is by corollary \ref{finiteSizeNCorollary}.
-Combining with the case when $s = []$, one gets
+Combining with the case when $s = []$, one obtains
 \begin{center}
 	\begin{tabular}{lcll}
 		$\rsize{r^* \backslash_r s}$ & $\leq$ & $max \; n_r \; 1 + (\textit{card} (\sizeNregex \; (N + n_r)))
@@ -2579,7 +2579,7 @@
 \noindent
 (9) is by theorem \ref{altsClosedForm}, (10) by lemma \ref{rsimpMono} and (11) by inductive hypothesis.
 
-Combining with the case when $s = []$, one gets
+Combining with the case when $s = []$, we obtain 
 \begin{center}
 	\begin{tabular}{lcll}
 		$\rsize{\sum rs \backslash_r s}$ & $\leq$ & $max \; \rsize{\sum rs} \; 1+N*(length \; rs)$ 
@@ -2828,6 +2828,8 @@
 Therefore we introduce some variants of $\opterm$,
 which help conveniently express the rewriting steps 
 needed in the closed form proof.
+We have $\optermOsimp$, $\optermosimp$ and $\optermsimp$
+with slightly different spellings because they help the proof to go through:
 \begin{center}
 	\begin{tabular}{lcl}
 	$\optermOsimp \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\
@@ -3084,8 +3086,11 @@
 	N + \llbracket r^{\{n\}} \rrbracket_r+1))$.
 \end{theorem}
 \begin{proof}
-We have that for all regular expressions $r'$ in $\textit{set} \; (\map \; (\optermsimp \; r) \; (
+We have that for all regular expressions $r'$ in 
+\begin{center}
+$\textit{set} \; (\map \; (\optermsimp \; r) \; (
 	\nupdates \; s \; r \; [\Some \; ([c], n)]))$,
+\end{center}
 	$r'$'s size is less than or equal to $N + \llbracket r^{\{n\}} 
 	\rrbracket_r + 1$
 because $r'$ can only be either a $\ZERO$ or $r \backslash_{rsimps} s' \cdot
@@ -3116,7 +3121,7 @@
 \section{Comments and Future Improvements}
 \subsection{Some Experimental Results}
 What guarantee does this bound give us?
-Whatever the regex is, it will not grow indefinitely.
+It states that whatever the regex is, it will not grow indefinitely.
 Take our previous example $(a + aa)^*$ as an example:
 \begin{center}
 	\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
@@ -3237,7 +3242,7 @@
 
 
 \subsection{Possible Further Improvements}
-There are two problems with this finiteness result, though.\\
+There are two problems with this finiteness result, though:\\
 (i)	
 		First, it is not yet a direct formalisation of our lexer's complexity,
 		as a complexity proof would require looking into 
@@ -3261,7 +3266,7 @@
 		backtracking is much less likely to occur in our $\blexersimp$
 		algorithm.
 		We refine $\blexersimp$ with $\blexerStrong$ in the next chapter
-		so that the bound becomes polynomial.
+		so that we conjecture the bound becomes polynomial.
 \end{itemize}
 
 %----------------------------------------------------------------------------------------
@@ -3326,7 +3331,7 @@
 [1mm]
 	$(1 \leq m' \leq m )$
 \end{center}
-There at at least exponentially
+There are at least exponentially
 many such terms.\footnote{To be exact, these terms are 
 distinct for $m' \leq L.C.M.(1, \ldots, n)$, the details are omitted,
 but the point is that the number is exponential.

--- a/ChengsongTanPhdThesis/Chapters/Inj.tex	Fri Dec 30 01:52:32 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Inj.tex	Fri Dec 30 17:37:51 2022 +0000
@@ -1383,7 +1383,16 @@
 and therefore can be thrown away.
 
 Ausaf et al. \cite{AusafDyckhoffUrban2016} 
-have come up with some simplification steps, however those steps
+have come up with some simplification steps, 
+and incorporated the simplification into $\lexer$.
+They call this lexer $\textit{lexer}_{simp}$ and proved that
+\[
+	\lexer \; r\; s = \textit{lexer}_{simp} \; r \; s
+\]
+The function $\textit{lexer}_{simp}$
+involves some fiddly manipulation of value rectification,
+which we omit here.
+however those steps
 are not yet sufficiently strong, to achieve the above effects.
 And even with these relatively mild simplifications, the proof
 is already quite a bit more complicated than the theorem \ref{lexerCorrectness}.

--- a/thys2/blexer2.sc	Fri Dec 30 01:52:32 2022 +0000
+++ b/thys2/blexer2.sc	Fri Dec 30 17:37:51 2022 +0000
@@ -1505,7 +1505,7 @@
     }
 
   }
-  naive_matcher()
+  // naive_matcher()
   //single(SEQ(SEQ(STAR(CHAR('b')),STAR(STAR(SEQ(CHAR('a'),CHAR('b'))))),
   //  SEQ(SEQ(CHAR('b'),STAR(ALTS(CHAR('a'),ONE))),ONE)))
 
@@ -1638,13 +1638,14 @@
 
     def bders_bderssimp() {
 
-      test(single(SEQ(ALTS(ONE,CHAR('c')),
-        SEQ(SEQ(CHAR('a'),CHAR('a')),ALTS(ONE,CHAR('c'))))), 1) { (r: Rexp) => 
+      
+      val r = //(SEQ(ALTS(ONE,CHAR('c')),
+        SEQ(SEQ(CHAR('a'),CHAR('a')),ALTS(ONE,CHAR('c')))
           // ALTS(SEQ(SEQ(ONE,CHAR('a')),STAR(CHAR('a'))),SEQ(ALTS(CHAR('c'),ONE),STAR(ZERO))))))), 1) { (r: Rexp) => 
-          val ss = Set("aa")//stringsFromRexp(r)
+      val ss = Set("aa")//stringsFromRexp(r)
           val boolList = ss.map(s => {
             //val bdStrong = bdersStrong(s.toList, internalise(r))
-            val bds = bsimp(bders(s.toList, internalise(r)))
+            val bds = (bders(s.toList, internalise(r)))
             val bdssimp = bsimp(bders_simp(s.toList, internalise(r)))
             //val pdersSet = pderUNIV(r)//.flatMap(r => turnIntoTerms(r))
             //val pdersSetBroken = pdersSet.flatMap(r => turnIntoTerms(r))
@@ -1652,17 +1653,19 @@
             //bdStrong6Set.size <= pdersSet.size || bdStrong6Set.size <= pdersSetBroken.size ||
             //rs1_subseteq_rs2(bdStrong6Set.toSet, pdersSet union pdersSetBroken)//|| bdStrong6Set.size <= pdersSetBroken.size
             rprint(r)
-            println(bds)
-            println(bdssimp)
+            println(asize(bds))
+            println(asize(bdssimp))
             bds == bdssimp
           })
           //!boolList.exists(b => b == false)
           !boolList.exists(b => b == false)
           }
-        }
-        //  bders_bderssimp()
+        
+
+
+        bders_bderssimp()
 
 
 
-println("hi!!!!!")
-counterexample_check()
\ No newline at end of file
+// println("hi!!!!!")
+// counterexample_check()
\ No newline at end of file

author	Chengsong
	Fri, 30 Dec 2022 17:37:51 +0000
changeset 639	80cc6dc4c98b
parent 638	dd9dde2d902b
child 640	bd1354127574

ChengsongTanPhdThesis/Chapters/Bitcoded1.tex		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/Chapters/Bitcoded2.tex		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/Chapters/Cubic.tex		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/Chapters/Finite.tex		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/Chapters/Inj.tex		file \| annotate \| diff \| comparison \| revisions
thys2/blexer2.sc		file \| annotate \| diff \| comparison \| revisions