lexing: changeset 622:4b1149fb5aec

--- a/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex	Tue Nov 08 23:24:54 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex	Fri Nov 11 00:23:53 2022 +0000
@@ -1044,7 +1044,7 @@
 some helper functions such as $\retrieve$ and
 $\flex$.
 \subsection{Specifications of Some Helper Functions}
-The functions we introduce will give a more detailed glimpse into 
+The functions we introduce will give a more detailed view into 
 the lexing process, which is not be possible
 using $\lexer$ or $\blexer$ alone.
 \subsubsection{$\textit{Retrieve}$}

--- a/ChengsongTanPhdThesis/Chapters/Inj.tex	Tue Nov 08 23:24:54 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Inj.tex	Fri Nov 11 00:23:53 2022 +0000
@@ -13,32 +13,34 @@
 This is essentially a description in ``English''
 of our formalisation in Isabelle/HOL.
 We also give the definition of what $\POSIX$ lexing means, 
-followed by a lexing algorithm by Sulzmanna and Lu\parencite{Sulzmann2014} 
+followed by a lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014} 
 that produces the output conforming
-to the $\POSIX$ standard.
-\footnote{In what follows 
+to the $\POSIX$ standard\footnote{In what follows 
 we choose to use the Isabelle-style notation
 for function applications, where
 the parameters of a function are not enclosed
 inside a pair of parentheses (e.g. $f \;x \;y$
 instead of $f(x,\;y)$). This is mainly
-to make the text visually more concise.}
+to make the text visually more concise.}.
 
 \section{Basic Concepts}
-Usually, formal language theory starts with an alphabet 
+Formal language theory usually starts with an alphabet 
 denoting a set of characters.
 Here we just use the datatype of characters from Isabelle,
 which roughly corresponds to the ASCII characters.
 In what follows, we shall leave the information about the alphabet
 implicit.
 Then using the usual bracket notation for lists,
-we can define strings made up of characters: 
+we can define strings made up of characters as: 
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{s}$ & $\dn$ & $[] \; |\; c  :: s$
 \end{tabular}
 \end{center}
 where $c$ is a variable ranging over characters.
+The $::$ stands for list cons and $[]$ for the empty
+list.
+A singleton list is sometimes written as $[c]$ for brevity.
 Strings can be concatenated to form longer strings in the same
 way as we concatenate two lists, which we shall write as $s_1 @ s_2$.
 We omit the precise 
@@ -98,42 +100,40 @@
 \end{center}
 \noindent
 which is essentially the left quotient $A \backslash L$ of $A$ against 
-the singleton language with $L = \{s\}$
-in formal language theory.
-However, for the purposes here, the $\textit{Ders}$ definition with 
+the singleton language with $L = \{s\}$.
+However, for our purposes here, the $\textit{Ders}$ definition with 
 a single string is sufficient.
 
 The reason for defining derivatives
-is that it provides a different approach
+is that they provide another approach
 to test membership of a string in 
 a set of strings. 
 For example, to test whether the string
 $bar$ is contained in the set $\{foo, bar, brak\}$, one takes derivative of the set with
 respect to the string $bar$:
 \begin{center}
-\begin{tabular}{lclll}
+\begin{tabular}{lll}
 	$S = \{foo, bar, brak\}$ & $ \stackrel{\backslash b}{\rightarrow }$ & 
-	$\{ar, rak\}$ & 
-	$\stackrel{\backslash a}{\rightarrow}$ &
-	\\
-	$\{r \}$ & $\stackrel{\backslash r}{\rightarrow}$ & $\{[]\}$ &
-	$\stackrel{[] \in S \backslash bar}{\longrightarrow}$ & $bar \in S$\\
+	$\{ar, rak\}$ \\
+				 & $\stackrel{\backslash a}{\rightarrow}$ & $\{r \}$\\
+				 & $\stackrel{\backslash r}{\rightarrow}$ & $\{[]\}$\\
+				 %& $\stackrel{[] \in S \backslash bar}{\longrightarrow}$ & $bar \in S$\\
 \end{tabular}	
 \end{center}
 \noindent
 and in the end test whether the set
-has the empty string \footnote{ we use the infix notation $A\backslash c$
+has the empty string.\footnote{We use the infix notation $A\backslash c$
 	instead of $\Der \; c \; A$ for brevity, as it is clear we are operating
-on languages rather than regular expressions }.
-In general, if we have a language $S_{start}$,
-then we can test whether $s$ is in $S_{start}$
+on languages rather than regular expressions.}
+
+In general, if we have a language $S$,
+then we can test whether $s$ is in $S$
 by testing whether $[] \in S \backslash s$.
-
 With the sequencing, Kleene star, and $\textit{Der}$ operator on languages,
 we have a  few properties of how the language derivative can be defined using 
 sub-languages.
 For example, for the sequence operator, we have
-something similar to the ``chain rule'' of the calculus derivative:
+something similar to a ``chain rule'':
 \begin{lemma}
 \[
 	\Der \; c \; (A @ B) =
@@ -178,8 +178,10 @@
 \end{proof}
 
 \noindent
-Before we define the $\textit{Der}$ and $\textit{Ders}$ counterpart
-for regular languages, we need to first give definitions for regular expressions.
+The clever idea of Brzozowski was to find counterparts of $\Der$ and $\Ders$
+for regular expressions.
+To introduce them, we need to first give definitions for regular expressions,
+which we shall do next.
 
 \subsection{Regular Expressions and Their Meaning}
 The \emph{basic regular expressions} are defined inductively
@@ -196,10 +198,10 @@
 and bounded repetitions.
 We use $\ZERO$ for the regular expression that
 matches no string, and $\ONE$ for the regular
-expression that matches only the empty string\footnote{
-some authors
+expression that matches only the empty string.\footnote{
+Some authors
 also use $\phi$ and $\epsilon$ for $\ZERO$ and $\ONE$
-but we prefer our notation}. 
+but we prefer our notation.} 
 The sequence regular expression is written $r_1\cdot r_2$
 and sometimes we omit the dot if it is clear which
 regular expression is meant; the alternative
@@ -213,72 +215,75 @@
 $L \; \ZERO$ & $\dn$ & $\phi$\\
 $L \; \ONE$ & $\dn$ & $\{[]\}$\\
 $L \; c$ & $\dn$ & $\{[c]\}$\\
-$L \; r_1 + r_2$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\
-$L \; r_1 \cdot r_2$ & $\dn$ & $ L \; r_1 @ L \; r_2$\\
-$L \; r^*$ & $\dn$ & $ (L\;r)*$
+$L \; (r_1 + r_2)$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\
+$L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; r_1 @ L \; r_2$\\
+$L \; (r^*)$ & $\dn$ & $ (L\;r)*$
 \end{tabular}
 \end{center}
 \noindent
-Now with language derivatives of a language and regular expressions and
-their language interpretations in place, we are ready to define derivatives on regular expressions.
+%Now with language derivatives of a language and regular expressions and
+%their language interpretations in place, we are ready to define derivatives on regular expressions.
+With $L$ we are ready to introduce Brzozowski derivatives on regular expressions.
+We do so by first introducing what properties it should satisfy.
+
 \subsection{Brzozowski Derivatives and a Regular Expression Matcher}
 %Recall, the language derivative acts on a set of strings
 %and essentially chops off a particular character from
 %all strings in that set, Brzozowski defined a derivative operation on regular expressions
 %so that after derivative $L(r\backslash c)$ 
 %will look as if it was obtained by doing a language derivative on $L(r)$:
-Recall that the language derivative acts on a 
-language (set of strings).
-One can decide whether a string $s$ belongs
-to a language $S$ by taking derivative with respect to
-that string and then checking whether the empty 
-string is in the derivative:
-\begin{center}
-\parskip \baselineskip
-\def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}}
-\def\rlwd{.5pt}
-\newcommand\notate[3]{%
-  \unskip\def\useanchorwidth{T}%
-  \setbox0=\hbox{#1}%
-  \def\stackalignment{c}\stackunder[-6pt]{%
-    \def\stackalignment{c}\stackunder[-1.5pt]{%
-      \stackunder[-2pt]{\strut #1}{\myupbracefill{\wd0}}}{%
-    \rule{\rlwd}{#2\baselineskip}}}{%
-  \strut\kern7pt$\hookrightarrow$\rlap{ \footnotesize#3}}\ignorespaces%
-}
-\Longstack{
-\notate{$\{ \ldots ,\;$ 
-	\notate{s}{1}{$(c_1 :: s_1)$} 
-	$, \; \ldots \}$ 
-}{1}{$S_{start}$} 
-}
-\Longstack{
-	$\stackrel{\backslash c_1}{\longrightarrow}$
-}
-\Longstack{
-	$\{ \ldots,\;$  \notate{$s_1$}{1}{$(c_2::s_2)$} 
-	$,\; \ldots \}$
-}
-\Longstack{
-	$\stackrel{\backslash c_2}{\longrightarrow}$ 
-}
-\Longstack{
-	$\{ \ldots,\;  s_2
-	,\; \ldots \}$
-}
-\Longstack{
-	$ \xdashrightarrow{\backslash c_3\ldots\ldots} $	
-}
-\Longstack{
-	\notate{$\{\ldots, [], \ldots\}$}{1}{$S_{end} = 
-	S_{start}\backslash s$}
-}
-\end{center}
-\begin{center}
-	$s \in S_{start} \iff [] \in S_{end}$
-\end{center}
-\noindent
-Brzozowski noticed that this operation
+%Recall that the language derivative acts on a 
+%language (set of strings).
+%One can decide whether a string $s$ belongs
+%to a language $S$ by taking derivative with respect to
+%that string and then checking whether the empty 
+%string is in the derivative:
+%\begin{center}
+%\parskip \baselineskip
+%\def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}}
+%\def\rlwd{.5pt}
+%\newcommand\notate[3]{%
+%  \unskip\def\useanchorwidth{T}%
+%  \setbox0=\hbox{#1}%
+%  \def\stackalignment{c}\stackunder[-6pt]{%
+%    \def\stackalignment{c}\stackunder[-1.5pt]{%
+%      \stackunder[-2pt]{\strut #1}{\myupbracefill{\wd0}}}{%
+%    \rule{\rlwd}{#2\baselineskip}}}{%
+%  \strut\kern7pt$\hookrightarrow$\rlap{ \footnotesize#3}}\ignorespaces%
+%}
+%\Longstack{
+%\notate{$\{ \ldots ,\;$ 
+%	\notate{s}{1}{$(c_1 :: s_1)$} 
+%	$, \; \ldots \}$ 
+%}{1}{$S_{start}$} 
+%}
+%\Longstack{
+%	$\stackrel{\backslash c_1}{\longrightarrow}$
+%}
+%\Longstack{
+%	$\{ \ldots,\;$  \notate{$s_1$}{1}{$(c_2::s_2)$} 
+%	$,\; \ldots \}$
+%}
+%\Longstack{
+%	$\stackrel{\backslash c_2}{\longrightarrow}$ 
+%}
+%\Longstack{
+%	$\{ \ldots,\;  s_2
+%	,\; \ldots \}$
+%}
+%\Longstack{
+%	$ \xdashrightarrow{\backslash c_3\ldots\ldots} $	
+%}
+%\Longstack{
+%	\notate{$\{\ldots, [], \ldots\}$}{1}{$S_{end} = 
+%	S_{start}\backslash s$}
+%}
+%\end{center}
+%\begin{center}
+%	$s \in S_{start} \iff [] \in S_{end}$
+%\end{center}
+%\noindent
+Brzozowski noticed that $\Der$
 can be ``mirrored'' on regular expressions which
 he calls the derivative of a regular expression $r$
 with respect to a character $c$, written
@@ -298,7 +303,8 @@
 \]
 \end{property}
 \noindent
-Pictorially, this looks as follows:
+Pictorially, this looks as follows:\\
+\vspace{3mm}
 
 \parskip \baselineskip
 \def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}}
@@ -322,16 +328,17 @@
 \Longstack{
 	\notate{$r\backslash c$}{1}{$L \; (r\backslash c)=
 	\{\ldots,\;s_1,\;\ldots\}$}
-}
-\\
+}\\ 
+\vspace{ 3mm }
+
 The derivatives on regular expression can again be 
 generalised to a string.
-One could compute $r\backslash s$  and test membership of $s$
-in $L \; r$ by checking
+One could compute $r_{start} \backslash s$  and test membership of $s$
+in $L \; r_{start}$ by checking
 whether the empty string is in the language of 
-$r\backslash s$.
+$r_{end}$ ($r_{start}\backslash s$).\\
 
-
+\vspace{2mm}
 \Longstack{
 	\notate{$r_{start}$}{4}{
 		\Longstack{$L \; r_{start} = \{\ldots, \;$ 
@@ -373,7 +380,7 @@
 	$s \in L \; r_{start} \iff [] \in L \; r_{end}$
 \end{property}
 \noindent
-Now we give the recursive definition of derivative on
+Next we give the recursive definition of derivative on
 regular expressions, so that it satisfies the properties above.
 The derivative function, written $r\backslash c$, 
 takes a regular expression $r$ and character $c$, and
@@ -485,13 +492,9 @@
 	By induction on $r$.
 \end{proof}
 \noindent
-The main property of the derivative operation
-(that enables us to reason about the correctness of
-derivative-based matching)
-is 
-
-
-\noindent
+which is the main property of derivatives
+that enables us to reason about the correctness of
+derivative-based matching.
 We can generalise the derivative operation shown above for single characters
 to strings as follows:
 
@@ -507,7 +510,8 @@
 omit the subscript and use $\backslash$ instead
 of $\backslash_s$ to denote
 string derivatives for brevity.
-Brzozowski's  regular-expression matching algorithm can then be described as:
+Brzozowski's derivative-based
+regular-expression matching algorithm can then be described as:
 
 \begin{definition}
 $\textit{match}\;s\;r \;\dn\; \nullable \; (r\backslash s)$
@@ -526,8 +530,8 @@
 \end{equation}
 
 \noindent
- It can  be
-relatively  easily shown that this matcher is correct:
+ It is relatively
+easy to show that this matcher is correct, namely
 \begin{lemma}
 	$\textit{match} \; s\; r  = \textit{true} \; \textit{iff} \; s \in L(r)$
 \end{lemma}
@@ -535,7 +539,6 @@
 	By induction on $s$ using property of derivatives:
 	lemma \ref{derDer}.
 \end{proof}
-\noindent
 \begin{figure}
 \begin{center}
 \begin{tikzpicture}
@@ -560,15 +563,16 @@
 \ref{NaiveMatcher}.
 Note that both axes are in logarithmic scale.
 Around two dozens characters
-would already explode the matcher on the regular expression 
+would already ``explode'' the matcher with the regular expression 
 $(a^*)^*b$.
-Too improve this situation, we need to introduce simplification
-rewrite rules for the intermediate results,
+To improve this situation, we need to introduce simplification
+rules for the intermediate results,
 such as $r + r \rightarrow r$,
 and make sure those rules do not change the 
 language of the regular expression.
 One simpled-minded simplification function
-that achieves these requirements is given below:
+that achieves these requirements 
+is given below (see Ausaf et al. \cite{AusafDyckhoffUrban2016}):
 \begin{center}
 	\begin{tabular}{lcl}
 		$\simp \; r_1 \cdot r_2 $ & $ \dn$ & 
@@ -620,13 +624,15 @@
 is now ``tame''  in terms of the length of inputs,
 as shown in Figure \ref{BetterMatcher}.
 
-So far the story is use Brzozowski derivatives,
-simplifying where possible and at the end test
-whether the empty string is recognised by the final derivative.
+So far the story is use Brzozowski derivatives and
+simplify as much as possible and at the end test
+whether the empty string is recognised 
+by the final derivative.
 But what if we want to 
 do lexing instead of just getting a true/false answer?
-Sulzmanna and Lu \cite{Sulzmann2014} first came up with a nice and 
-elegant (arguably as beautiful as the definition of the original derivative) solution for this.
+Sulzmann and Lu \cite{Sulzmann2014} first came up with a nice and 
+elegant (arguably as beautiful as the definition of the 
+Brzozowski derivative) solution for this.
 
 \section{Values and the Lexing Algorithm by Sulzmann and Lu}
 In this section, we present a two-phase regular expression lexing 
@@ -686,7 +692,8 @@
 \end{center}
 Sulzmann and Lu used a binary predicate, written $\vdash v:r $,
 to indicate that a value $v$ could be generated from a lexing algorithm
-with input $r$. They call it the value inhabitation relation. 
+with input $r$. They call it the value inhabitation relation,
+defined by the rules.
 \begin{mathpar}
 	\inferrule{\mbox{}}{\vdash \Char \; c : \mathbf{c}} \hspace{2em}
 
@@ -707,23 +714,23 @@
 satisfying $|v| = s$ and $\vdash v:r$ is finite.
 This additional condition was
 imposed by Ausaf and Urban to make their proofs easier.
-Given the same string and regular expression, there can be
+Given a string and a regular expression, there can be
 multiple values for it. For example, both
 $\vdash \Seq(\Left \; ab)(\Right \; c):(ab+a)(bc+c)$ and
 $\vdash \Seq(\Right\; a)(\Left \; bc ):(ab+a)(bc+c)$ hold
 and the values both flatten to $abc$.
 Lexers therefore have to disambiguate and choose only
-one of the values to output. $\POSIX$ is one of the
+one of the values be generated. $\POSIX$ is one of the
 disambiguation strategies that is widely adopted.
 
-Ausaf and Urban\parencite{AusafDyckhoffUrban2016} 
+Ausaf et al.\parencite{AusafDyckhoffUrban2016} 
 formalised the property 
 as a ternary relation.
 The $\POSIX$ value $v$ for a regular expression
 $r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified 
-in the following set of rules\footnote{The names of the rules are used
-as they were originally given in \cite{AusafDyckhoffUrban2016} }:
-\begin{figure}[H]
+in the following rules\footnote{The names of the rules are used
+as they were originally given in \cite{AusafDyckhoffUrban2016}.}:
+\begin{figure}[p]
 \begin{mathpar}
 	\inferrule[P1]{\mbox{}}{([], \ONE) \rightarrow \Empty}
 		
@@ -745,58 +752,64 @@
 		s_1@s_3 \in L \; r \land s_4 \in L \; r^*}{(s_1@s_2, r^*)\rightarrow \Stars \;
 	(v::vs)}
 \end{mathpar}
-\caption{The inductive POSIX Lexing Rules defined by Ausaf, Dyckhoff and Urban \cite{AusafDyckhoffUrban2016}.
-The ternary relation, written $(s, r) \rightarrow v$, formalises the POSIX constraints on the
+\caption{The inductive POSIX rules given by Ausaf et al.
+	\cite{AusafDyckhoffUrban2016}.
+This ternary relation, written $(s, r) \rightarrow v$, 
+formalises the POSIX constraints on the
 value $v$ given a string $s$ and 
 regular expression $r$.
-For example, this specification says that all matches for an alternative 
-must always prefer a left value to a right one.
 }
-\end{figure}
+\end{figure}\afterpage{\clearpage}
 \noindent
 
-\begin{figure}
-\begin{tikzpicture}[]
-    \node [minimum width = 6cm, rectangle split, rectangle split horizontal, 
-	    rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}]
-	    (node1)
-	    {$r_{token1}$
-	    \nodepart{two}  $\;\;\; \quad r_{token2}\;\;\;\quad$ };
-	    %\node [left = 6.0cm of node1] (start1) {hi};
-	    \node [left = 0.2cm of node1] (middle) {$v.s.$};
-    \node [minimum width = 6cm, left = 0.2cm of middle, rectangle split, rectangle split horizontal, 
-	    rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}]
-	    (node2)
-	    {$\quad\;\;\;r_{token1}\quad\;\;\;$
-	    \nodepart{two}  $r_{token2}$ };
-	    \node [below = 0.1cm of node2] (text1) {\checkmark preferred by POSIX};
-		\node [above = 1.5cm of middle, minimum width = 6cm, 
-			rectangle, style={draw, rounded corners, inner sep=10pt}] 
-			(topNode) {$s$};
-	    \path[->,draw]
-	        (topNode) edge node {split $A$} (node2)
-	        (topNode) edge node {split $B$} (node1)
-		;
-			
-
-\end{tikzpicture}
-\caption{Maximum munch example: $s$ matches $r_{token1} \cdot r_{token2}$}\label{munch}
-\end{figure}
+%\begin{figure}
+%\begin{tikzpicture}[]
+%    \node [minimum width = 6cm, rectangle split, rectangle split horizontal, 
+%	    rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}]
+%	    (node1)
+%	    {$r_{token1}$
+%	    \nodepart{two}  $\;\;\; \quad r_{token2}\;\;\;\quad$ };
+%	    %\node [left = 6.0cm of node1] (start1) {hi};
+%	    \node [left = 0.2cm of node1] (middle) {$v.s.$};
+%    \node [minimum width = 6cm, left = 0.2cm of middle, rectangle split, rectangle split horizontal, 
+%	    rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}]
+%	    (node2)
+%	    {$\quad\;\;\;r_{token1}\quad\;\;\;$
+%	    \nodepart{two}  $r_{token2}$ };
+%	    \node [below = 0.1cm of node2] (text1) {\checkmark preferred by POSIX};
+%		\node [above = 1.5cm of middle, minimum width = 6cm, 
+%			rectangle, style={draw, rounded corners, inner sep=10pt}] 
+%			(topNode) {$s$};
+%	    \path[->,draw]
+%	        (topNode) edge node {split $A$} (node2)
+%	        (topNode) edge node {split $B$} (node1)
+%		;
+%			
+%
+%\end{tikzpicture}
+%\caption{Maximum munch example: $s$ matches $r_{token1} \cdot r_{token2}$}\label{munch}
+%\end{figure}
 The above $\POSIX$ rules follows the intuition described below: 
 \begin{itemize}
 	\item (Left Priority)\\
 		Match the leftmost regular expression when multiple options of matching
-		are available.
+		are available. See P+L and P+R where in P+R $s$ cannot
+		be in the language of $L \; r_1$.
 	\item (Maximum munch)\\
 		Always match a subpart as much as possible before proceeding
-		to the next token.
+		to the next part of the string.
 		For example, when the string $s$ matches 
-		$r_{token1}\cdot r_{token2}$, and we have two ways $s$ can be split:
-		Then the split that matches a longer string for the first token
-		$r_{token1}$ is preferred by this maximum munch rule (See
-		\ref{munch} for an illustration).
-\noindent
-
+		$r_{part1}\cdot r_{part2}$, and we have two ways $s$ can be split:
+		Then the split that matches a longer string for the first part
+		$r_{part1}$ is preferred by this maximum munch rule.
+		This is caused by the side-condition 
+		\begin{center}
+		$\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land 
+		s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2$
+		\end{center}
+		in PS.
+		%(See
+		%\ref{munch} for an illustration).
 \end{itemize}
 \noindent
 These disambiguation strategies can be 
@@ -805,28 +818,31 @@
 \[ 
 	\textit{iffoo} = 3
 \]
-using the regular expression (with 
-keyword and identifier having their 
-usualy definitions on any formal
-language textbook, for instance
-keyword is a nonempty string starting with letters 
-followed by alphanumeric characters or underscores):
+using a regular expression 
+for keywords and 
+identifiers:
+%(for example, keyword is a nonempty string starting with letters 
+%followed by alphanumeric characters or underscores):
 \[
-	\textit{keyword} + \textit{identifier},
+	r_{keyword} + r_{identifier}.
 \]
-we want $\textit{iffoo}$ to be recognized
-as an identifier rather than a keyword (if)
+If we want $\textit{iffoo}$ to be recognized
+as an identifier
+where identifiers are defined as usual (letters
+followed by letters, numbers or underscores),
+then a match with a keyword (if)
 followed by
-an identifier (foo).
-POSIX lexing achieves this.
+an identifier (foo) would be incorrect.
+POSIX lexing would generate what we want.
 
-We know that a $\POSIX$ value is also a normal underlying
-value:
+\noindent
+We know that a POSIX 
+value for regular expression $r$ is inhabited by $r$.
 \begin{lemma}
 $(r, s) \rightarrow v \implies \vdash v: r$
 \end{lemma}
 \noindent
-The good property about a $\POSIX$ value is that 
+The main property about a $\POSIX$ value is that 
 given the same regular expression $r$ and string $s$,
 one can always uniquely determine the $\POSIX$ value for it:
 \begin{lemma}
@@ -877,20 +893,20 @@
 is the same as $\Seq(v_1, v_2)$. 
 \end{proof}
 \noindent
-Now we know what a $\POSIX$ value is and why it is unique;
-the problem is generating 
+We have now defined what a POSIX value is and shown that it is unique,
+the problem is to generate
 such a value in a lexing algorithm using derivatives.
 
 \subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}
 
 Sulzmann and Lu extended Brzozowski's 
 derivative-based matching
-to a lexing algorithm by a second pass
+to a lexing algorithm by a second phase 
 after the initial phase of successive derivatives.
 This second phase generates a POSIX value 
 if the regular expression matches the string.
 Two functions are involved: $\inj$ and $\mkeps$.
-The first one used is $\mkeps$, which constructs a POSIX value from the last
+The function $\mkeps$ constructs a POSIX value from the last
 derivative $r_n$:
 \begin{ceqn}
 \begin{equation}\label{graph:mkeps}
@@ -903,14 +919,12 @@
 \noindent
 In the above diagram, again we assume that
 the input string $s$ is made of $n$ characters
-$c_0c_1 \ldots c_{n-1}$, and the input regular expression $r$
-is given label $0$ and after each character $c_i$ is taken off
-by the derivative operation the resulting derivative regular 
-expressioin is $r_{i+1}$.The last derivative operation 
-$\backslash c_{n-1}$ gives back $r_n$, which is transformed into 
-a value $v_n$ by $\mkeps$.
-$v_n$ tells us how an empty string is matched by the (nullable)
-regular expression $r_n$, in a $\POSIX$ way.
+$c_0c_1 \ldots c_{n-1}$ 
+The last derivative operation 
+$\backslash c_{n-1}$ generates the derivative $r_n$, for which
+$\mkeps$ produces the value $v_n$. This value
+tells us how the empty string is matched by the (nullable)
+regular expression $r_n$, in a POSIX way.
 The definition of $\mkeps$ is
 	\begin{center}
 		\begin{tabular}{lcl}
@@ -926,26 +940,24 @@
 
 
 \noindent 
-We favour the left child $r_1$ of $r_1 + r_2$ 
+The function prefers the left child $r_1$ of $r_1 + r_2$ 
 to match an empty string if there is a choice.
-When there is a star for us to match the empty string,
+When there is a star to match the empty string,
 we give the $\Stars$ constructor an empty list, meaning
 no iteration is taken.
-The result of a call to $\mkeps$ on a $\nullable$ $r$ would
-be a $\POSIX$ value corresponding to $r$:
+The result of $\mkeps$ on a $\nullable$ $r$ 
+is a POSIX value for $r$ and the empty string:
 \begin{lemma}\label{mePosix}
 $\nullable\; r \implies (r, []) \rightarrow (\mkeps\; v)$
 \end{lemma}
 \begin{proof}
-	By induction on the shape of $r$.
+	By induction on $r$.
 \end{proof}
 \noindent
-After the $\mkeps$-call, we inject back the characters one by one
+After the $\mkeps$-call, Sulzmann and Lu inject back the characters one by one
 in reverse order as they were chopped off in the derivative phase.
-The fucntion for this is called $\inj$. $\inj$ and $\backslash$
-are not exactly reverse operations of one another, as $\inj$ 
-operates on values instead of regular
-expressions.
+The fucntion for this is called $\inj$. This function 
+operates on values, unlike $\backslash$ which operates on regular expressions.
 In the diagram below, $v_i$ stands for the (POSIX) value 
 for how the regular expression 
 $r_i$ matches the string $s_i$ consisting of the last $n-i$ characters

--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex	Tue Nov 08 23:24:54 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex	Fri Nov 11 00:23:53 2022 +0000
@@ -850,28 +850,32 @@
 a (non-negligible) chance they might backtrack super-linearly,
 as shown in the graphs in figure\ref{fig:aStarStarb}.
 
-\subsection{Summary of the Catastrophic Backtracking Problem}
 Summing these up, we can categorise existing 
 practical regex libraries into two kinds:
-(i)The ones  with  linear
-time guarantees like Go and Rust. The cost with them is that
+(i) The ones  with  linear
+time guarantees like Go and Rust. The downside with them is that
 they impose restrictions
-on the user input (not allowing back-references, 
-bounded repetitions cannot exceed a counter limit etc.).
+on the regular expressions (not allowing back-references, 
+bounded repetitions cannot exceed an ad hoc limit etc.).
 (ii) Those 
 that allow large bounded regular expressions and back-references
-at the expense of using a backtracking algorithm.
-They could grind to a halt
-on some very simple cases, posing a vulnerability of
-a ReDoS attack.
+at the expense of using backtracking algorithms.
+They can potentially ``grind to a halt''
+on some very simple cases, resulting 
+ReDoS attacks.
 
- 
-We would like to have regex engines that can 
-deal with the regular part (e.g.
-bounded repetitions) of regexes more
-efficiently.
-Also we want to make sure that they do it correctly.
-It turns out that such aim is not so easy to achieve.
+The proble with both approaches is the motivation for us 
+to look again at the regular expression matching problem. 
+Another motivation is that regular expression matching algorithms
+that follow the POSIX standard often contain errors and bugs 
+as we shall explain next.
+
+%We would like to have regex engines that can 
+%deal with the regular part (e.g.
+%bounded repetitions) of regexes more
+%efficiently.
+%Also we want to make sure that they do it correctly.
+%It turns out that such aim is not so easy to achieve.
  %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions 
 % For example, the Rust regex engine claims to be linear, 
 % but does not support lookarounds and back-references.
@@ -894,54 +898,56 @@
 
 
 \section{Error-prone POSIX Implementations}
-When there are multiple ways of matching a string
-with a regular expression, a matcher needs to
+Very often there are multiple ways of matching a string
+with a regular expression.
+In such cases the regular expressions matcher needs to
 disambiguate.
-The standard for which particular match to pick
-is called the disambiguation strategy.
-The more intuitive strategy is called POSIX,
-which always chooses the longest initial match.
-An alternative strategy would be greedy matches,
-which always ends a sub-match as early as possible.
-The POSIX standard is widely adopted in many operating systems.
+The more widely used strategy is called POSIX,
+which roughly speaking always chooses the longest initial match.
+The POSIX strategy is widely adopted in many regular expression matchers.
 However, many implementations (including the C libraries
 used by Linux and OS X distributions) contain bugs
 or do not meet the specification they claim to adhere to.
-In some cases, they either fail to generate a lexing 
+Kuklewicz maintains a unit test repository which lists some
+problems with existing regular expression engines \ref{KuklewiczHaskell}.
+In some cases, they either fail to generate a
 result when there exists a match,
-or give results that are inconsistent with the $\POSIX$ standard.
-A concrete example would be the regex given by \cite{fowler2003}
+or give results that are inconsistent with the POSIX standard.
+A concrete example is the regex:
 \begin{center}
 	$(aba + ab + a)^* \text{and the string} ababa$
 \end{center}
-The correct $\POSIX$ match for the above would be 
-with the entire string $ababa$, 
-split into two Kleene star iterations, $[ab] [aba]$ at positions
+The correct POSIX match for the above
+is the entire string $ababa$, 
+split into two Kleene star iterations, namely $[ab], [aba]$ at positions
 $[0, 2), [2, 5)$
 respectively.
-But trying this out in regex101\parencite{regex101}
-with different language engines would yield 
-the same two fragmented matches: $[aba]$ at $[0, 3)$
+But trying this out in regex101 \parencite{regex101} \footnote{
+	regex101 is an online regular expression matcher which
+	provides API for trying out regular expression
+	engines of multiple popular programming languages like
+Java, Python, Go, etc.}
+with different engines yields
+always matches: $[aba]$ at $[0, 3)$
 and $a$ at $[4, 5)$.
 Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell} 
 commented that most regex libraries are not
-correctly implementing the POSIX (maximum-munch)
-rule of regular expression matching.
-As Grathwohl\parencite{grathwohl2014crash} wrote,
+correctly implementing the central POSIX
+rule, called the maximum munch rule.
+Grathwohl \parencite{grathwohl2014crash} wrote,
 \begin{quote}
 	``The POSIX strategy is more complicated than the 
 	greedy because of the dependence on information about 
 	the length of matched strings in the various subexpressions.''
 \end{quote}
 %\noindent
-The implementation complexity of POSIX rules also come from
-the specification being not very clear.
+We think the implementation complexity of POSIX rules also come from
+the specification being not very precise.
 There are many informal summaries of this disambiguation
 strategy, which are often quite long and delicate.
 For example Kuklewicz \cite{KuklewiczHaskell} 
-described the POSIX rule as
+described the POSIX rule as (section 1, last paragraph):
 \begin{quote}
-	``
 	\begin{itemize}
 		\item
 regular expressions (REs) take the leftmost starting match, and the longest match starting there
@@ -958,88 +964,84 @@
 \item
 if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
 \item
-if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\''
+if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
 \end{itemize}
 \end{quote}
-The text above 
-is trying to capture something very precise,
-and is crying out for formalising.
+%The text above 
+%is trying to capture something very precise,
+%and is crying out for formalising.
 Ausaf et al. \cite{AusafDyckhoffUrban2016}
-are the first to fill the gap
-by not just describing such a formalised POSIX
-specification in Isabelle/HOL, but also proving
+are the first to
+give a quite simple formalised POSIX
+specification in Isabelle/HOL, and also prove 
 that their specification coincides with the 
-POSIX specification given by Okui and Suzuki \cite{Okui10} 
-which is a completely
-different characterisation.
+POSIX specification given by Okui and Suzuki \cite{Okui10}.
 They then formally proved the correctness of
 a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014}
-based on that specification.
+with regards to that specification.
+They also found that the informal POSIX
+specification by Sulzmann and Lu does not work for the correctness proof.
 
-In the next section we will very briefly
+In the next section we will briefly
 introduce Brzozowski derivatives and Sulzmann
-and Lu's algorithm, which this thesis builds on.
-We give a taste of what they 
-are like and why they are suitable for regular expression
-matching and lexing.
- 
-\section{Our Solution--Formal Specification of POSIX Matching 
+and Lu's algorithm, which the main point of this thesis builds on.
+%We give a taste of what they 
+%are like and why they are suitable for regular expression
+%matching and lexing.
+\section{Formal Specification of POSIX Matching 
 and Brzozowski Derivatives}
-Now we start with the central topic of the thesis: Brzozowski derivatives.
+%Now we start with the central topic of the thesis: Brzozowski derivatives.
 Brzozowski \cite{Brzozowski1964} first introduced the 
-concept of the \emph{derivative} in the 1960s.
+concept of a \emph{derivative} of regular expression in 1964.
 The derivative of a regular expression $r$
-with respect to a character $c$, is written as $r \backslash c$.\footnote{
-	Despite having the same name, regular expression
-	derivatives bear little similarity with the mathematical definition
-	of derivatives on functions.
-}
-It tells us what $r$ would transform into
-if we chop off the first character $c$ 
-from all strings in the language of $r$ ($L \; r$).
-To give a flavour of Brzozowski derivatives, we present
-two straightforward clauses from it:
-\begin{center}
-	\begin{tabular}{lcl}
-		$d \backslash c$     & $\dn$ & 
-		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
-$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
-	\end{tabular}
-\end{center}
-\noindent
-The first clause says that for the regular expression
-denoting a singleton set consisting of a sinlge-character string $\{ d \}$,
-we check the derivative character $c$ against $d$,
-returning a set containing only the empty string $\{ [] \}$
-if $c$ and $d$ are equal, and the empty set $\varnothing$ otherwise.
-The second clause states that to obtain the regular expression
-representing all strings' head character $c$ being chopped off
-from $r_1 + r_2$, one simply needs to recursively take derivative
-of $r_1$ and $r_2$ and then put them together.
-
-Thanks to the definition, derivatives have the nice property
+with respect to a character $c$, is written as $r \backslash c$.
+This operation tells us what $r$ transforms into
+if we ``chop'' off the first character $c$ 
+from all strings in the language of $r$ (defined
+later as $L \; r$).
+%To give a flavour of Brzozowski derivatives, we present
+%two straightforward clauses from it:
+%\begin{center}
+%	\begin{tabular}{lcl}
+%		$d \backslash c$     & $\dn$ & 
+%		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
+%$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
+%	\end{tabular}
+%\end{center}
+%\noindent
+%The first clause says that for the regular expression
+%denoting a singleton set consisting of a sinlge-character string $\{ d \}$,
+%we check the derivative character $c$ against $d$,
+%returning a set containing only the empty string $\{ [] \}$
+%if $c$ and $d$ are equal, and the empty set $\varnothing$ otherwise.
+%The second clause states that to obtain the regular expression
+%representing all strings' head character $c$ being chopped off
+%from $r_1 + r_2$, one simply needs to recursively take derivative
+%of $r_1$ and $r_2$ and then put them together.
+Derivatives have the property
 that $s \in L \; (r\backslash c)$ if and only if 
-$c::s \in L \; r$.
+$c::s \in L \; r$ where $::$ stands for list prepending.
 %This property can be used on regular expressions
 %matching and lexing--to test whether a string $s$ is in $L \; r$,
 %one simply takes derivatives of $r$ successively with
 %respect to the characters (in the correct order) in $s$,
 %and then test whether the empty string is in the last regular expression.
-Derivatives give a simple solution
-to the problem of matching and lexing a string $s$ with a regular
+With this derivatives give a simple solution
+to the problem of matching a string $s$ with a regular
 expression $r$: if the derivative of $r$ w.r.t.\ (in
 succession) all the characters of the string matches the empty string,
 then $r$ matches $s$ (and {\em vice versa}).  
-
-This makes formally reasoning about these properties such
-as correctness and complexity smooth and intuitive.
-In fact, there has already been several mechanised proofs about them,
-for example the one by Owens and Slind \cite{Owens2008} in HOL4,
+%This makes formally reasoning about these properties such
+%as correctness and complexity smooth and intuitive.
+There had been several mechanised proofs about this property in various theorem
+provers,
+for example one by Owens and Slind \cite{Owens2008} in HOL4,
 another one by Krauss and Nipkow \cite{Nipkow98} in Isabelle/HOL, and
 yet another in Coq by Coquand and Siles \cite{Coquand2012}.
 
-In addition, one can extend the clauses to bounded repetitions
-``for free'':
+In addition, one can extend derivatives to bounded repetitions
+relatively straightforwardly. For example, the derivative for 
+this can be defined as:
 \begin{center}
 	\begin{tabular}{lcl}
 		$r^{\{n\}} \backslash c$     & $\dn$ & $r \backslash c \cdot
@@ -1047,19 +1049,20 @@
 	\end{tabular}
 \end{center}
 \noindent
-And experimental results suggest that  unlike DFA-based solutions,
-this derivatives can support 
-bounded regular expressions with large counters
+Experimental results suggest that  unlike DFA-based solutions
+for bounded regular expressions,
+derivatives can cope
+large counters
 quite well.
 
 There has also been 
 extensions to other constructs.
 For example, Owens et al include the derivatives
-for \emph{NOT} regular expressions, which is
+for the \emph{NOT} regular expression, which is
 able to concisely express C-style comments of the form
-$/* \ldots */$.
-Another extension for derivatives would be
-regular expressions with look-aheads, done by
+$/* \ldots */$ (see \cite{Owens2008}).
+Another extension for derivatives is
+regular expressions with look-aheads, done
 by Miyazaki and Minamide
 \cite{Takayuki2019}.
 %We therefore use Brzozowski derivatives on regular expressions 
@@ -1079,7 +1082,7 @@
 \cite{Almeidaetal10}, \cite{Berglund14}, \cite{Berglund18},
 \cite{Chen12} and \cite{Coquand2012}
 to name a few), despite being buried in the ``sands of time'' \cite{Owens2008}
-after they were first published.
+after they were first published by Brzozowski.
 
 
 However, there are two difficulties with derivative-based matchers:
@@ -1100,35 +1103,37 @@
 is made for the Verbatim \cite{Verbatim}
 %TODO: give references
 lexer, which calculates POSIX matches and is based on derivatives.
-They formalized the correctness of the lexer, but not the complexity.
-In the performance evaluation section, they simply analyzed the run time
+They formalized the correctness of the lexer, but not their complexity result.
+In the performance evaluation section, they analyzed the run time
 of matching $a$ with the string 
 \begin{center}
-	$\underbrace{a \ldots a}_{\text{n a's}}$
+	$\underbrace{a \ldots a}_{\text{n a's}}$.
 \end{center}
-and concluded that the algorithm is quadratic in terms of input length.
+\noindent
+They concluded that the algorithm is quadratic in terms of 
+the length of the input string.
 When we tried out their extracted OCaml code with our example $(a+aa)^*$,
-the time it took to lex only 40 $a$'s was 5 minutes.
+the time it took to match a string of 40 $a$'s was approximately 5 minutes.
 
 
 \subsection{Sulzmann and Lu's Algorithm}
 Sulzmann and Lu~\cite{Sulzmann2014} overcame the first 
-difficulty by cleverly extending Brzozowski's matching
+problem with the yes/no answer 
+by cleverly extending Brzozowski's matching
 algorithm. Their extended version generates additional information on
 \emph{how} a regular expression matches a string following the POSIX
 rules for regular expression matching. They achieve this by adding a
 second ``phase'' to Brzozowski's algorithm involving an injection
-function simplification of internal data structures 
-eliminating the exponential behaviours.
-In an earlier work, Ausaf et al provided the formal
+function.
+In earlier work, Ausaf et al provided the formal
 specification of what POSIX matching means and proved in Isabelle/HOL
 the correctness
-of Sulzmann and Lu's extended algorithm accordingly
+of this extended algorithm accordingly
 \cite{AusafDyckhoffUrban2016}.
 
 The version of the algorithm proven correct 
-suffers from the
-second difficulty though, where the internal derivatives can
+suffers heavily from a 
+second difficulty, where the internal derivatives can
 grow to arbitrarily big sizes. 
 For example if we start with the
 regular expression $(a+aa)^*$ and take
@@ -1147,23 +1152,22 @@
 \end{center}
  
 \noindent where after around 35 steps we run out of memory on a
-typical computer (we shall define in the next chapter 
-the precise details of our
-regular expressions and the derivative operation).  Clearly, the
+typical computer.  Clearly, the
 notation involving $\ZERO$s and $\ONE$s already suggests
 simplification rules that can be applied to regular regular
 expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r
 \Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow
-r$. While such simple-minded simplifications have been proved in our
-earlier work to preserve the correctness of Sulzmann and Lu's
+r$. While such simple-minded simplifications have been proved in 
+the work by Ausaf et al. to preserve the correctness of Sulzmann and Lu's
 algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do
 \emph{not} help with limiting the growth of the derivatives shown
 above: the growth is slowed, but the derivatives can still grow rather
 quickly beyond any finite bound.
 
-Sulzmann and Lu overcome this ``growth problem'' in a second algorithm
-\cite{Sulzmann2014} where they introduce bit-coded
-regular expressions. In this version, POSIX values are
+Therefore we want to look in this thesis at a second
+algorithm by Sulzmann and Lu where they
+overcame this ``growth problem'' \cite{Sulzmann2014}.
+In this version, POSIX values are 
 represented as bit sequences and such sequences are incrementally generated
 when derivatives are calculated. The compact representation
 of bit sequences and regular expressions allows them to define a more
@@ -1184,20 +1188,16 @@
   extensively [..] but yet
   have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
 \end{quote}  
-Ausaf and Urban were able to back this correctness claim with
-a formal proof.
-
-However a faster formally verified 
-lexing program with the optimisations
-mentioned by Sulzmann and Lu's second algorithm
-is still missing.
-As they stated,
+Ausaf and Urban made some initial progress towards the 
+full correctness proof but still had to leave out the optimisation
+Sulzmann and Lu proposed.
+Ausaf  wrote \cite{Ausaf},
   \begin{quote}\it
 ``The next step would be to implement a more aggressive simplification procedure on annotated regular expressions and then prove the corresponding algorithm generates the same values as blexer. Alas due to time constraints we are unable to do so here.''
 \end{quote}  
 This thesis implements the aggressive simplifications envisioned
 by Ausaf and Urban,
-together with a formal proof of the correctness with those simplifications.
+together with a formal proof of the correctness of those simplifications.
 
 
 One of the most recent work in the context of lexing
@@ -1239,56 +1239,36 @@
 and extension to
 the bounded repetitions construct with the correctness and finiteness property
 maintained.
- \end{itemize}
-
-
+\end{itemize}
+\noindent
 With a formal finiteness bound in place,
 we can greatly reduce the attack surface of servers in terms of ReDoS attacks.
 Further improvements to the algorithm with an even stronger version of 
-simplification is made.
-Thanks to our theorem-prover-friendly approach,
-we believe that 
-this finiteness bound can be improved to a bound
-linear to input and
-cubic to the regular expression size using a technique by
-Antimirov\cite{Antimirov95}.
-Once formalised, this would be a guarantee for the absence of all super-linear behavious.
-We are working out the
-details.
-
- 
-To our best knowledge, no lexing libraries using Brzozowski derivatives
-have similar complexity-related bounds, 
-and claims about running time are usually speculative and backed by empirical
-evidence on a few test cases.
-If a matching or lexing algorithm
-does not come with certain basic complexity related 
-guarantees (for examaple the internal data structure size
-does not grow indefinitely), 
-then they cannot claim with confidence having solved the problem
-of catastrophic backtracking.
+simplification can be made.
 
 
 
 
 
 \section{Structure of the thesis}
-In chapter 2 \ref{Inj} we will introduce the concepts
+In chapter \ref{Inj} we will introduce the concepts
 and notations we 
-use for describing the lexing algorithm by Sulzmann and Lu,
-and then give the lexing algorithm.
-We will give its variant in \ref{Bitcoded1}.
-Then we illustrate in \ref{Bitcoded2}
-how the algorithm without bitcodes falls short for such aggressive 
-simplifications and therefore introduce our version of the
- bit-coded algorithm and 
+use for describing regular expressions and derivatives,
+and the first version of their lexing algorithm without bitcodes (including 
+its correctness proof).
+We will give their second lexing algorithm with bitcodes in \ref{Bitcoded1}
+together with the correctness proof by Ausaf and Urban.
+Then we illustrate in chapter \ref{Bitcoded2}
+how Sulzmann and Lu's
+simplifications fail to simplify. We therefore introduce our version of the
+algorithm with simplification and 
 its correctness proof .  
-In \ref{Finite} we give the second guarantee
+In chapter \ref{Finite} we give the second guarantee
 of our bitcoded algorithm, that is a finite bound on the size of any 
-regex's derivatives.
-In \ref{Cubic} we discuss stronger simplifications to improve the finite bound
-in \ref{Finite} to a polynomial one, and demonstrate how one can extend the
-algorithm to include constructs such as bounded repetitions and negations.
+regular expression's derivatives.
+In chapter \ref{Cubic} we discuss stronger simplification rules which 
+improve the finite bound to a polynomial bound, and also show how one can extend the
+algorithm to include bounded repetitions. %and the NOT regular expression.

--- a/ChengsongTanPhdThesis/Chapters/RelatedWork.tex	Tue Nov 08 23:24:54 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/RelatedWork.tex	Fri Nov 11 00:23:53 2022 +0000
@@ -49,3 +49,30 @@
 for regexes, attack string that exploit the worst-time 
 scenario, and "attack automata" that generates
 attack strings.
+
+
+
+
+
+
+Thanks to our theorem-prover-friendly approach,
+we believe that 
+this finiteness bound can be improved to a bound
+linear to input and
+cubic to the regular expression size using a technique by
+Antimirov\cite{Antimirov95}.
+Once formalised, this would be a guarantee for the absence of all super-linear behavious.
+We are working out the
+details.
+
+ 
+To our best knowledge, no lexing libraries using Brzozowski derivatives
+have similar complexity-related bounds, 
+and claims about running time are usually speculative and backed by empirical
+evidence on a few test cases.
+If a matching or lexing algorithm
+does not come with certain basic complexity related 
+guarantees (for examaple the internal data structure size
+does not grow indefinitely), 
+then they cannot claim with confidence having solved the problem
+of catastrophic backtracking.

--- a/ChengsongTanPhdThesis/example.bib	Tue Nov 08 23:24:54 2022 +0000
+++ b/ChengsongTanPhdThesis/example.bib	Fri Nov 11 00:23:53 2022 +0000
@@ -7,14 +7,6 @@
 %% Saved with string encoding Unicode (UTF-8) 
 
 
-%@phdthesis{Fahad18,
-%  author  = "Rempel, Robert Charles",
-%  title   = "Relaxation Effects for Coupled Nuclear Spins",
-%  school  = "Stanford University",
-%  address = "Stanford, CA",
-%  year    = 1956,
-%  month   = jun
-%}
 @article{Murugesan2014,
   author    = {N.~Murugesan and O.~V.~Shanmuga Sundaram},
   title     = {{S}ome {P}roperties of {B}rzozowski {D}erivatives of {R}egular {E}xpressions},
@@ -35,8 +27,8 @@
 
 %% POSIX specification------------------------
 @InProceedings{Okui10,
-author="Okui, Satoshi
-and Suzuki, Taro",
+author=" S.~Okui
+and T.~Suzuki",
 editor="Domaratzki, Michael
 and Salomaa, Kai",
 title="Disambiguation in Regular Expression Matching via Position Automata with Augmented Transitions",
@@ -53,7 +45,7 @@
 
 %% Brzozowski ders------------------------
 @article{Berglund14,
-author = {Berglund, Martin and Drewes, Frank and Van Der Merwe, Brink},
+author = {M.~Berglund, F.~Drewes and B.~Van Der Merwe},
 year = {2014},
 month = {05},
 pages = {},
@@ -64,7 +56,7 @@
 }
 
 @InProceedings{Berglund18,
-author="Berglund, Martin
+author="M.~Berglund
 and Bester, Willem
 and van der Merwe, Brink",
 editor="Fischer, Bernd
@@ -93,16 +85,16 @@
 
 
 %% Brzozowski ders------------------------
-@article{Murugesan2014,
-  author    = {N.~Murugesan and O.~V.~Shanmuga Sundaram},
-  title     = {{S}ome {P}roperties of {B}rzozowski {D}erivatives of {R}egular {E}xpressions},
-  journal   = {International Journal of Computer Trends and Technology},
-  volume    = {13},
-  number    = {1},
-  year      = {2014},
-  url       = {http://arxiv.org/abs/1407.5902},
-  pages     = {29--33}
-}
+%@article{Murugesan2014,
+%  author    = {N.~Murugesan and O.~V.~Shanmuga Sundaram},
+%  title     = {{S}ome {P}roperties of {B}rzozowski {D}erivatives of {R}egular {E}xpressions},
+%  journal   = {International Journal of Computer Trends and Technology},
+%  volume    = {13},
+%  number    = {1},
+%  year      = {2014},
+%  url       = {http://arxiv.org/abs/1407.5902},
+%  pages     = {29--33}
+%}
 
 %% look-aheads------------------------
 @article{Takayuki2019,

author	Chengsong
	Fri, 11 Nov 2022 00:23:53 +0000
changeset 622	4b1149fb5aec
parent 621	17c7611fb0a9
child 623	c0c1ebe09c7d

ChengsongTanPhdThesis/Chapters/Bitcoded1.tex		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/Chapters/Inj.tex		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/Chapters/Introduction.tex		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/Chapters/RelatedWork.tex		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/example.bib		file \| annotate \| diff \| comparison \| revisions