ChengsongTanPhdThesis/Chapters/Introduction.tex
changeset 622 4b1149fb5aec
parent 620 ae6010c14e49
child 628 7af4e2420a8c
--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex	Tue Nov 08 23:24:54 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex	Fri Nov 11 00:23:53 2022 +0000
@@ -850,28 +850,32 @@
 a (non-negligible) chance they might backtrack super-linearly,
 as shown in the graphs in figure\ref{fig:aStarStarb}.
 
-\subsection{Summary of the Catastrophic Backtracking Problem}
 Summing these up, we can categorise existing 
 practical regex libraries into two kinds:
-(i)The ones  with  linear
-time guarantees like Go and Rust. The cost with them is that
+(i) The ones  with  linear
+time guarantees like Go and Rust. The downside with them is that
 they impose restrictions
-on the user input (not allowing back-references, 
-bounded repetitions cannot exceed a counter limit etc.).
+on the regular expressions (not allowing back-references, 
+bounded repetitions cannot exceed an ad hoc limit etc.).
 (ii) Those 
 that allow large bounded regular expressions and back-references
-at the expense of using a backtracking algorithm.
-They could grind to a halt
-on some very simple cases, posing a vulnerability of
-a ReDoS attack.
+at the expense of using backtracking algorithms.
+They can potentially ``grind to a halt''
+on some very simple cases, resulting 
+ReDoS attacks.
 
- 
-We would like to have regex engines that can 
-deal with the regular part (e.g.
-bounded repetitions) of regexes more
-efficiently.
-Also we want to make sure that they do it correctly.
-It turns out that such aim is not so easy to achieve.
+The proble with both approaches is the motivation for us 
+to look again at the regular expression matching problem. 
+Another motivation is that regular expression matching algorithms
+that follow the POSIX standard often contain errors and bugs 
+as we shall explain next.
+
+%We would like to have regex engines that can 
+%deal with the regular part (e.g.
+%bounded repetitions) of regexes more
+%efficiently.
+%Also we want to make sure that they do it correctly.
+%It turns out that such aim is not so easy to achieve.
  %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions 
 % For example, the Rust regex engine claims to be linear, 
 % but does not support lookarounds and back-references.
@@ -894,54 +898,56 @@
 
 
 \section{Error-prone POSIX Implementations}
-When there are multiple ways of matching a string
-with a regular expression, a matcher needs to
+Very often there are multiple ways of matching a string
+with a regular expression.
+In such cases the regular expressions matcher needs to
 disambiguate.
-The standard for which particular match to pick
-is called the disambiguation strategy.
-The more intuitive strategy is called POSIX,
-which always chooses the longest initial match.
-An alternative strategy would be greedy matches,
-which always ends a sub-match as early as possible.
-The POSIX standard is widely adopted in many operating systems.
+The more widely used strategy is called POSIX,
+which roughly speaking always chooses the longest initial match.
+The POSIX strategy is widely adopted in many regular expression matchers.
 However, many implementations (including the C libraries
 used by Linux and OS X distributions) contain bugs
 or do not meet the specification they claim to adhere to.
-In some cases, they either fail to generate a lexing 
+Kuklewicz maintains a unit test repository which lists some
+problems with existing regular expression engines \ref{KuklewiczHaskell}.
+In some cases, they either fail to generate a
 result when there exists a match,
-or give results that are inconsistent with the $\POSIX$ standard.
-A concrete example would be the regex given by \cite{fowler2003}
+or give results that are inconsistent with the POSIX standard.
+A concrete example is the regex:
 \begin{center}
 	$(aba + ab + a)^* \text{and the string} ababa$
 \end{center}
-The correct $\POSIX$ match for the above would be 
-with the entire string $ababa$, 
-split into two Kleene star iterations, $[ab] [aba]$ at positions
+The correct POSIX match for the above
+is the entire string $ababa$, 
+split into two Kleene star iterations, namely $[ab], [aba]$ at positions
 $[0, 2), [2, 5)$
 respectively.
-But trying this out in regex101\parencite{regex101}
-with different language engines would yield 
-the same two fragmented matches: $[aba]$ at $[0, 3)$
+But trying this out in regex101 \parencite{regex101} \footnote{
+	regex101 is an online regular expression matcher which
+	provides API for trying out regular expression
+	engines of multiple popular programming languages like
+Java, Python, Go, etc.}
+with different engines yields
+always matches: $[aba]$ at $[0, 3)$
 and $a$ at $[4, 5)$.
 Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell} 
 commented that most regex libraries are not
-correctly implementing the POSIX (maximum-munch)
-rule of regular expression matching.
-As Grathwohl\parencite{grathwohl2014crash} wrote,
+correctly implementing the central POSIX
+rule, called the maximum munch rule.
+Grathwohl \parencite{grathwohl2014crash} wrote,
 \begin{quote}
 	``The POSIX strategy is more complicated than the 
 	greedy because of the dependence on information about 
 	the length of matched strings in the various subexpressions.''
 \end{quote}
 %\noindent
-The implementation complexity of POSIX rules also come from
-the specification being not very clear.
+We think the implementation complexity of POSIX rules also come from
+the specification being not very precise.
 There are many informal summaries of this disambiguation
 strategy, which are often quite long and delicate.
 For example Kuklewicz \cite{KuklewiczHaskell} 
-described the POSIX rule as
+described the POSIX rule as (section 1, last paragraph):
 \begin{quote}
-	``
 	\begin{itemize}
 		\item
 regular expressions (REs) take the leftmost starting match, and the longest match starting there
@@ -958,88 +964,84 @@
 \item
 if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
 \item
-if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\''
+if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
 \end{itemize}
 \end{quote}
-The text above 
-is trying to capture something very precise,
-and is crying out for formalising.
+%The text above 
+%is trying to capture something very precise,
+%and is crying out for formalising.
 Ausaf et al. \cite{AusafDyckhoffUrban2016}
-are the first to fill the gap
-by not just describing such a formalised POSIX
-specification in Isabelle/HOL, but also proving
+are the first to
+give a quite simple formalised POSIX
+specification in Isabelle/HOL, and also prove 
 that their specification coincides with the 
-POSIX specification given by Okui and Suzuki \cite{Okui10} 
-which is a completely
-different characterisation.
+POSIX specification given by Okui and Suzuki \cite{Okui10}.
 They then formally proved the correctness of
 a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014}
-based on that specification.
+with regards to that specification.
+They also found that the informal POSIX
+specification by Sulzmann and Lu does not work for the correctness proof.
 
-In the next section we will very briefly
+In the next section we will briefly
 introduce Brzozowski derivatives and Sulzmann
-and Lu's algorithm, which this thesis builds on.
-We give a taste of what they 
-are like and why they are suitable for regular expression
-matching and lexing.
- 
-\section{Our Solution--Formal Specification of POSIX Matching 
+and Lu's algorithm, which the main point of this thesis builds on.
+%We give a taste of what they 
+%are like and why they are suitable for regular expression
+%matching and lexing.
+\section{Formal Specification of POSIX Matching 
 and Brzozowski Derivatives}
-Now we start with the central topic of the thesis: Brzozowski derivatives.
+%Now we start with the central topic of the thesis: Brzozowski derivatives.
 Brzozowski \cite{Brzozowski1964} first introduced the 
-concept of the \emph{derivative} in the 1960s.
+concept of a \emph{derivative} of regular expression in 1964.
 The derivative of a regular expression $r$
-with respect to a character $c$, is written as $r \backslash c$.\footnote{
-	Despite having the same name, regular expression
-	derivatives bear little similarity with the mathematical definition
-	of derivatives on functions.
-}
-It tells us what $r$ would transform into
-if we chop off the first character $c$ 
-from all strings in the language of $r$ ($L \; r$).
-To give a flavour of Brzozowski derivatives, we present
-two straightforward clauses from it:
-\begin{center}
-	\begin{tabular}{lcl}
-		$d \backslash c$     & $\dn$ & 
-		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
-$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
-	\end{tabular}
-\end{center}
-\noindent
-The first clause says that for the regular expression
-denoting a singleton set consisting of a sinlge-character string $\{ d \}$,
-we check the derivative character $c$ against $d$,
-returning a set containing only the empty string $\{ [] \}$
-if $c$ and $d$ are equal, and the empty set $\varnothing$ otherwise.
-The second clause states that to obtain the regular expression
-representing all strings' head character $c$ being chopped off
-from $r_1 + r_2$, one simply needs to recursively take derivative
-of $r_1$ and $r_2$ and then put them together.
-
-Thanks to the definition, derivatives have the nice property
+with respect to a character $c$, is written as $r \backslash c$.
+This operation tells us what $r$ transforms into
+if we ``chop'' off the first character $c$ 
+from all strings in the language of $r$ (defined
+later as $L \; r$).
+%To give a flavour of Brzozowski derivatives, we present
+%two straightforward clauses from it:
+%\begin{center}
+%	\begin{tabular}{lcl}
+%		$d \backslash c$     & $\dn$ & 
+%		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
+%$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
+%	\end{tabular}
+%\end{center}
+%\noindent
+%The first clause says that for the regular expression
+%denoting a singleton set consisting of a sinlge-character string $\{ d \}$,
+%we check the derivative character $c$ against $d$,
+%returning a set containing only the empty string $\{ [] \}$
+%if $c$ and $d$ are equal, and the empty set $\varnothing$ otherwise.
+%The second clause states that to obtain the regular expression
+%representing all strings' head character $c$ being chopped off
+%from $r_1 + r_2$, one simply needs to recursively take derivative
+%of $r_1$ and $r_2$ and then put them together.
+Derivatives have the property
 that $s \in L \; (r\backslash c)$ if and only if 
-$c::s \in L \; r$.
+$c::s \in L \; r$ where $::$ stands for list prepending.
 %This property can be used on regular expressions
 %matching and lexing--to test whether a string $s$ is in $L \; r$,
 %one simply takes derivatives of $r$ successively with
 %respect to the characters (in the correct order) in $s$,
 %and then test whether the empty string is in the last regular expression.
-Derivatives give a simple solution
-to the problem of matching and lexing a string $s$ with a regular
+With this derivatives give a simple solution
+to the problem of matching a string $s$ with a regular
 expression $r$: if the derivative of $r$ w.r.t.\ (in
 succession) all the characters of the string matches the empty string,
 then $r$ matches $s$ (and {\em vice versa}).  
-
-This makes formally reasoning about these properties such
-as correctness and complexity smooth and intuitive.
-In fact, there has already been several mechanised proofs about them,
-for example the one by Owens and Slind \cite{Owens2008} in HOL4,
+%This makes formally reasoning about these properties such
+%as correctness and complexity smooth and intuitive.
+There had been several mechanised proofs about this property in various theorem
+provers,
+for example one by Owens and Slind \cite{Owens2008} in HOL4,
 another one by Krauss and Nipkow \cite{Nipkow98} in Isabelle/HOL, and
 yet another in Coq by Coquand and Siles \cite{Coquand2012}.
 
-In addition, one can extend the clauses to bounded repetitions
-``for free'':
+In addition, one can extend derivatives to bounded repetitions
+relatively straightforwardly. For example, the derivative for 
+this can be defined as:
 \begin{center}
 	\begin{tabular}{lcl}
 		$r^{\{n\}} \backslash c$     & $\dn$ & $r \backslash c \cdot
@@ -1047,19 +1049,20 @@
 	\end{tabular}
 \end{center}
 \noindent
-And experimental results suggest that  unlike DFA-based solutions,
-this derivatives can support 
-bounded regular expressions with large counters
+Experimental results suggest that  unlike DFA-based solutions
+for bounded regular expressions,
+derivatives can cope
+large counters
 quite well.
 
 There has also been 
 extensions to other constructs.
 For example, Owens et al include the derivatives
-for \emph{NOT} regular expressions, which is
+for the \emph{NOT} regular expression, which is
 able to concisely express C-style comments of the form
-$/* \ldots */$.
-Another extension for derivatives would be
-regular expressions with look-aheads, done by
+$/* \ldots */$ (see \cite{Owens2008}).
+Another extension for derivatives is
+regular expressions with look-aheads, done
 by Miyazaki and Minamide
 \cite{Takayuki2019}.
 %We therefore use Brzozowski derivatives on regular expressions 
@@ -1079,7 +1082,7 @@
 \cite{Almeidaetal10}, \cite{Berglund14}, \cite{Berglund18},
 \cite{Chen12} and \cite{Coquand2012}
 to name a few), despite being buried in the ``sands of time'' \cite{Owens2008}
-after they were first published.
+after they were first published by Brzozowski.
 
 
 However, there are two difficulties with derivative-based matchers:
@@ -1100,35 +1103,37 @@
 is made for the Verbatim \cite{Verbatim}
 %TODO: give references
 lexer, which calculates POSIX matches and is based on derivatives.
-They formalized the correctness of the lexer, but not the complexity.
-In the performance evaluation section, they simply analyzed the run time
+They formalized the correctness of the lexer, but not their complexity result.
+In the performance evaluation section, they analyzed the run time
 of matching $a$ with the string 
 \begin{center}
-	$\underbrace{a \ldots a}_{\text{n a's}}$
+	$\underbrace{a \ldots a}_{\text{n a's}}$.
 \end{center}
-and concluded that the algorithm is quadratic in terms of input length.
+\noindent
+They concluded that the algorithm is quadratic in terms of 
+the length of the input string.
 When we tried out their extracted OCaml code with our example $(a+aa)^*$,
-the time it took to lex only 40 $a$'s was 5 minutes.
+the time it took to match a string of 40 $a$'s was approximately 5 minutes.
 
 
 \subsection{Sulzmann and Lu's Algorithm}
 Sulzmann and Lu~\cite{Sulzmann2014} overcame the first 
-difficulty by cleverly extending Brzozowski's matching
+problem with the yes/no answer 
+by cleverly extending Brzozowski's matching
 algorithm. Their extended version generates additional information on
 \emph{how} a regular expression matches a string following the POSIX
 rules for regular expression matching. They achieve this by adding a
 second ``phase'' to Brzozowski's algorithm involving an injection
-function simplification of internal data structures 
-eliminating the exponential behaviours.
-In an earlier work, Ausaf et al provided the formal
+function.
+In earlier work, Ausaf et al provided the formal
 specification of what POSIX matching means and proved in Isabelle/HOL
 the correctness
-of Sulzmann and Lu's extended algorithm accordingly
+of this extended algorithm accordingly
 \cite{AusafDyckhoffUrban2016}.
 
 The version of the algorithm proven correct 
-suffers from the
-second difficulty though, where the internal derivatives can
+suffers heavily from a 
+second difficulty, where the internal derivatives can
 grow to arbitrarily big sizes. 
 For example if we start with the
 regular expression $(a+aa)^*$ and take
@@ -1147,23 +1152,22 @@
 \end{center}
  
 \noindent where after around 35 steps we run out of memory on a
-typical computer (we shall define in the next chapter 
-the precise details of our
-regular expressions and the derivative operation).  Clearly, the
+typical computer.  Clearly, the
 notation involving $\ZERO$s and $\ONE$s already suggests
 simplification rules that can be applied to regular regular
 expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r
 \Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow
-r$. While such simple-minded simplifications have been proved in our
-earlier work to preserve the correctness of Sulzmann and Lu's
+r$. While such simple-minded simplifications have been proved in 
+the work by Ausaf et al. to preserve the correctness of Sulzmann and Lu's
 algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do
 \emph{not} help with limiting the growth of the derivatives shown
 above: the growth is slowed, but the derivatives can still grow rather
 quickly beyond any finite bound.
 
-Sulzmann and Lu overcome this ``growth problem'' in a second algorithm
-\cite{Sulzmann2014} where they introduce bit-coded
-regular expressions. In this version, POSIX values are
+Therefore we want to look in this thesis at a second
+algorithm by Sulzmann and Lu where they
+overcame this ``growth problem'' \cite{Sulzmann2014}.
+In this version, POSIX values are 
 represented as bit sequences and such sequences are incrementally generated
 when derivatives are calculated. The compact representation
 of bit sequences and regular expressions allows them to define a more
@@ -1184,20 +1188,16 @@
   extensively [..] but yet
   have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
 \end{quote}  
-Ausaf and Urban were able to back this correctness claim with
-a formal proof.
-
-However a faster formally verified 
-lexing program with the optimisations
-mentioned by Sulzmann and Lu's second algorithm
-is still missing.
-As they stated,
+Ausaf and Urban made some initial progress towards the 
+full correctness proof but still had to leave out the optimisation
+Sulzmann and Lu proposed.
+Ausaf  wrote \cite{Ausaf},
   \begin{quote}\it
 ``The next step would be to implement a more aggressive simplification procedure on annotated regular expressions and then prove the corresponding algorithm generates the same values as blexer. Alas due to time constraints we are unable to do so here.''
 \end{quote}  
 This thesis implements the aggressive simplifications envisioned
 by Ausaf and Urban,
-together with a formal proof of the correctness with those simplifications.
+together with a formal proof of the correctness of those simplifications.
 
 
 One of the most recent work in the context of lexing
@@ -1239,56 +1239,36 @@
 and extension to
 the bounded repetitions construct with the correctness and finiteness property
 maintained.
- \end{itemize}
-
-
+\end{itemize}
+\noindent
 With a formal finiteness bound in place,
 we can greatly reduce the attack surface of servers in terms of ReDoS attacks.
 Further improvements to the algorithm with an even stronger version of 
-simplification is made.
-Thanks to our theorem-prover-friendly approach,
-we believe that 
-this finiteness bound can be improved to a bound
-linear to input and
-cubic to the regular expression size using a technique by
-Antimirov\cite{Antimirov95}.
-Once formalised, this would be a guarantee for the absence of all super-linear behavious.
-We are working out the
-details.
-
- 
-To our best knowledge, no lexing libraries using Brzozowski derivatives
-have similar complexity-related bounds, 
-and claims about running time are usually speculative and backed by empirical
-evidence on a few test cases.
-If a matching or lexing algorithm
-does not come with certain basic complexity related 
-guarantees (for examaple the internal data structure size
-does not grow indefinitely), 
-then they cannot claim with confidence having solved the problem
-of catastrophic backtracking.
+simplification can be made.
 
 
 
 
 
 \section{Structure of the thesis}
-In chapter 2 \ref{Inj} we will introduce the concepts
+In chapter \ref{Inj} we will introduce the concepts
 and notations we 
-use for describing the lexing algorithm by Sulzmann and Lu,
-and then give the lexing algorithm.
-We will give its variant in \ref{Bitcoded1}.
-Then we illustrate in \ref{Bitcoded2}
-how the algorithm without bitcodes falls short for such aggressive 
-simplifications and therefore introduce our version of the
- bit-coded algorithm and 
+use for describing regular expressions and derivatives,
+and the first version of their lexing algorithm without bitcodes (including 
+its correctness proof).
+We will give their second lexing algorithm with bitcodes in \ref{Bitcoded1}
+together with the correctness proof by Ausaf and Urban.
+Then we illustrate in chapter \ref{Bitcoded2}
+how Sulzmann and Lu's
+simplifications fail to simplify. We therefore introduce our version of the
+algorithm with simplification and 
 its correctness proof .  
-In \ref{Finite} we give the second guarantee
+In chapter \ref{Finite} we give the second guarantee
 of our bitcoded algorithm, that is a finite bound on the size of any 
-regex's derivatives.
-In \ref{Cubic} we discuss stronger simplifications to improve the finite bound
-in \ref{Finite} to a polynomial one, and demonstrate how one can extend the
-algorithm to include constructs such as bounded repetitions and negations.
+regular expression's derivatives.
+In chapter \ref{Cubic} we discuss stronger simplification rules which 
+improve the finite bound to a polynomial bound, and also show how one can extend the
+algorithm to include bounded repetitions. %and the NOT regular expression.